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Abstract 



This chapter introduces some concentration inequalities for discrete-time martingales with bounded increments, and it 
exemplifies some of their potential applications in information theory and related topics. The first part of this chapter introduces 
. briefly discrete-time martingales and the Azuma-Hoeffding & McDiardmid's inequalities which are widely used in this context. 
It then derives these refined inequalities, followed by a discussion on their relations to some classical results in probability 
0^ . theory. It also considers a geometric interpretation of some of these inequalities, providing an insight on the inter-connections 
" between them. The second part exemplifies the use of these refined inequalities in the context of hypothesis testing, information 
theory, communications, and coding. The chapter is concluded with a discussion on some directions for further research. 

Index Terms 

Concentration of measures, error exponents, Fisher information, hypothesis testing, information divergence, large deviations, 
martingales, moderate deviations principle. 
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t""*- . I. Introduction 

Q\ ■ Inequalities providing upper bounds on probabilities of the type P(\X — x\ > t) (or ¥(X — x > t) for a 
random variable (RV) X, where x denotes the expectation or median of X, have been among the main tools 
t— 1 1 of probability theory. These inequalities are known as concentration inequalities, and they have been subject to 
interesting developments in probability theory. Very roughly speaking, the concentration of measure phenomenon 
i can be stated in the following simple way: "A random variable that depends in a smooth way on many independent 
^ " random variables (but not too much on any of them) is essentially constant" [75]. The exact meaning of such a 
• <-h ■ statement clearly needs to be clarified rigorously, but it will often mean that such a random variable X concentrates 
rS ! around x in a way that the probability of the event {\X — x\ > t} decays exponentially in t (for t > 0). The 
C3 ' foundations in concentration of measures have been introduced, e.g., in [3, Chapter 7], [15, Chapter 2], [16], [42], 
[47], [48, Chapter 5], [50], [74] and [75]. Concentration inequalities are also at the core of probabilistic analysis 
of randomized algorithms (see, e.g., [3], [23], [50] and [61]). 

The Chernoff bounds provide sharp concentration inequalities when the considered RV X can be expressed as a 
sum of n independent and bounded RVs. However, the situation is clearly more complex for non-product measures 
where the concentration property may not exist. Several techniques have been developed to prove concentration of 
measures. Among several methodologies, these include Talagrand's concentration inequalities for product measures 
(e.g., [74] and [75] with some information-theoretic applications in [40] and [41]), logarithmic-Sobolev inequalities 
(e.g., [23, Chapter 14], [42, Chapter 5] and [47] with information-theoretic aspects in [37], [38]), transportation-cost 
inequalities which originated from information theory (e.g., [23, Chapters 12, 13] and [42, Chapter 6]), and the 
martingale approach (e.g., [3, Chapter 7], [50] with information-theoretic aspects in, e.g., [45], [60], [61], [81]). 
This chapter mainly considers the last methodology, focusing on discrete-time martingales with bounded jumps. 

The Azuma-Hoeffding inequality is by now a well-known methodology that has been often used to prove 
concentration phenomena for discrete-time martingales whose jumps are bounded almost surely. It is due to 
Hoeffding [34] who proved this inequality for X = Y^i=i where {Xi} are independent and bounded RVs, and 
Azuma [7] later extended it to bounded-difference martingales. It is noted that the Azuma-Hoeffding inequality for 
a bounded martingale-difference sequence was extended to centering sequences with bounded differences [51]; this 
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extension provides sharper concentration results for, e.g., sequences that are related to sampling without replacement. 
Some relative entropy and exponential deviation bounds were derived in [39] for an important class of Markov 
chains, and these bounds are essentially identical to the Hoeffding inequality in the special case of i.i.d. RVs. A 
common method for proving concentration of a function / : M. n — > R of n independent RVs, around the expected 
value E[/], where the function / is characterized by bounded differences whenever the n-dimensional vectors differ 
in only one coordinate, is called McDiarmid's inequality or the 'independent bounded differences inequality' (see 
[50, Theorem 3.1]). This inequality was proved (with some possible extensions) via the martingale approach (see 
[50, Section 3.5]). Although the proof of this inequality has some similarity to the proof of the Azuma-Hoeffding 
inequality, the former inequality is stated under a condition which provides an improvement by a factor of 4 in the 
exponent. Some of its nice applications to algorithmic discrete mathematics were exemplified in [50, Section 3]. 

The use of the Azuma-Hoeffding inequality was introduced to the computer science literature in [70] in order to 
prove concentration, around the expected value, of the chromatic number for random graphs. The chromatic number 
of a graph is defined to be the minimal number of colors that is required to color all the vertices of this graph so that 
no two vertices which are connected by an edge have the same color, and the ensemble for which concentration was 
demonstrated in [70] was the ensemble of random graphs with n vertices such that any ordered pair of vertices in the 
graph is connected by an edge with a fixed probability p for some p G (0, 1). It is noted that the concentration result 
in [70] was established without knowing the expected value over this ensemble. The migration of this bounding 
inequality into coding theory, especially for exploring some concentration phenomena that are related to the analysis 
of codes defined on graphs and iterative message -passing decoding algorithms, was initiated in [45], [60] and [72]. 
During the last decade, the Azuma-Hoeffding inequality has been extensively used for proving concentration of 
measures in coding theory (see, e.g., [61, Appendix C] and references therein). In general, all these concentration 
inequalities serve to justify theoretically the ensemble approach of codes defined on graphs. However, much stronger 
concentration phenomena are observed in practice. The Azuma-Hoeffding inequality was also recently used in [77] 
for the analysis of probability estimation in the rare-events regime where it was assumed that an observed string 
is drawn i.i.d. from an unknown distribution, but the alphabet size and the source distribution both scale with the 
block length (so the empirical distribution does not converge to the true distribution as the block length tends to 
infinity). In [80], the Azuma-Hoeffding inequality was used to derive achievable rates and random coding error 
exponents for non-linear additive white Gaussian noise channels. This analysis was followed by another recent 
work of the same authors [81] who used some other concentration inequalities, for discrete -parameter martingales 
with bounded jumps, to derive achievable rates and random coding error exponents for non-linear Volterra channels 
(where their bounding technique can be also applied to intersymbol-interference (ISI) channels, as noted in [81]). 
This direction of research was further studied in [82], and improved achievable rates have been derived via refined 
version of the Azuma-Hoeffding inequality. 

This chapter is structured as follows: Section II presents briefly discrete-time (sub/ super) martingales, Section III 
presents the Azuma-Hoeffding inequality and McDiardmid's inequality; these are widely used in proving concen- 
tration, and their derivation relies on the martingale approach. Section IV derives some refined versions of the 
Azuma-Hoeffding inequality, and it considers interconnections between these bounds. Section V considers some 
connections between the concentration inequalities that are introduced in Section IV to the method of types, a 
central limit theorem for martingales, the law of iterated logarithm, the moderate deviations principle for i.i.d. real- 
valued random variables, and some previously-reported concentration inequalities for discrete-parameter martingales 
with bounded jumps. Section VI forms the second part of this work, applying the concentration inequalities from 
Section IV to information theory and some related topics. This chapter is summarized in Section VII, followed 
by a discussion on some topics, mainly related to information theory and coding, for further research. Various 
mathematical details of the analysis are relegated to the appendices. This work is meant to stimulate the derivation 
of some new refined versions of concentration inequalities for martingales with a further consideration of their 
possible applications in aspects that are related to information theory, communications and coding. 

In connection to the presentation in this chapter, the reader is referred to [3, Chapter 11], [15, Chapter 2], [16] 
and [50] as some additional surveys on concentration inequalities for (sub/ super) martingales. 
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II. Discrete-Time Martingales 

A. Martingales 

This sub-section provides a short background on martingales to set definitions and notation (the reader is referred, 
e.g., to [78] for a nice exposition of discrete-time martingales). We will not use any result about martingales beyond 
the definition and few basic properties that will be mentioned explicitly. 

Definition 1: [Martingale] Let (fi, T , P) be a probability space. A martingale sequence is a sequence Xq, Xi,... 
of random variables (RVs) and corresponding sub d-algebras To, T\, . . . that satisfy the following conditions: 

1) Xi G L 1 (0, J-i,F) for every i, i.e., each Xi is defined on the same sample space $7, it is measurable with 
respect to the a-algebra T% (i.e., Xi is Ti -measurable) and E[|Xj|] = J Q \Xi(cj)\dF(oj) < oo. 

2) To <_ T\ C . . . (this sequence is called a filtration). 

3) For all i G N, JQ_i = EpQlFj-i] almost surely (a.s.). 

In this case, it is written that {Xj,^}^ or {Xi, Ti}i^ (with No = N U {0}) is a martingale sequence (the 
inclusion of X^ and Too in the martingale is not required here). 

Remark 1: Since {J^j^Q forms a filtration, then it follows from the tower principle for conditional expectations 
that a.s. 

X j =E[X i \T j }, Vi>j. 

Also for every t G N, E[JQ] = E[E[Xj|.Fj_i]] = EpQ_i], so the expectation of a martingale sequence stays 
constant. 

Remark 2: One can generate martingale sequences by the following procedure: Given a RV X G L 1 (r2,7 r , P) 
and an arbitrary filtration of sub cr-algebras {Ti}^, let 

X i = E[X\T i }, Vie {0,1,...}. 

Then, the sequence Xq,X\,. . . forms a martingale since 

1) The RV Xi = EfXjJi] is J^-measurable, and also E[|Xj|] < E[|X|] < oo (since conditioning reduces the 
expectation of the absolute value). 

2) By construction {J^j^Q is a filtration. 

3) For every i G N 

E[Xi|Ji_i] =E[E[X|JF.]|jF._ 1 ] 

= E[X\-F i - 1 ] (since Jj-iCj,) 
= a.s. 

Remark 3: In continuation to Remark 2, one can choose Tq = {0, 0} and T n = J 7 , so that Xq,X\, . . . , X n is 
a martingale sequence where 

X = E[X\ To] = E[X] (since X is independent of T ) 
X n = ElXl^n] = X a.s. (since X is J'-measurable). 

In this case, one gets a martingale sequence where the first element is the expected value of X, and the last element 
of the sequence is X itself (a.s.). This has the following interpretation: At the beginning, one doesn't know anything 
about X, so it is initially estimated by its expectation. At each step more and more information about X is revealed 
until one is able to specify it exactly (a.s.). 

B. Sub/ Super Martingales 

Sub and super martingales require the first two conditions in Definition 1, and the equality in the third condition 
of Definition 1 is relaxed to one of the following inequalities: 

• Ef.Xjl.Fj_i] > -Xj-i holds a.s. for sub-martingales. 

• E[Xj|Fj_i] < -Xj_i holds a.s. for super-martingales. 

Clearly, every process that is both a sub and super-martingale is a martingale. Furthermore, {Xi,^} is a sub- 
martingale if and only if {— Xi,Ti\ is a super-martingale. The following properties are direct consequences of 
Jensen's inequality for conditional expectations: 
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• If {Xi, Fi} is a martingale, h is a convex (concave) function and E[|/tpQ)|] < oo, then {h(Xi),Ti} is a sub 
(super) martingale. 

• If {Xi, Fi} is a super-martingale, h is monotonic increasing and concave, and E[|/i(Xj)|] < oo, then {h(Xi), T{\ 
is a super-martingale. Similarly, if {Xi, F} is a sub-martingale, h is monotonic increasing and convex, and 
E[|/i(Xj)|] < oo, then {h(Xi),Ti} is a sub-martingale. 

III. Two Basic Concentration Inequalities 

In the following section, we prove two basic inequalities that are widely used for proving concentration inequal- 
ities. Their proofs conveys the main concepts of the martingale approach for proving concentration results. Their 
presentation also motivates some refinements that are considered later in this chapter, followed by some applications. 

A. The Azuma-Hoeffding Inequality 

The Azuma-Hoeffding inequality 1 is a useful concentration inequality for bounded-difference martingales. It was 
proved in [34] for independent bounded random variables, followed by a discussion on sums of dependent random 
variables; this inequality was later derived in [7] for the more general setting of bounded-difference martingales. 
In the following, this inequality is introduced. 

Theorem 1: [Azuma-Hoeffding inequality] Let {Xk,Fk}kLo be a discrete -parameter real-valued martingale 
sequence such that for every k G N, the condition \X^ — X^_i\ < holds a.s. for some non-negative constants 
{<4}jfcLi- Then > for ever y n G N and a > 0, 



P(|X„ - X \ > a) < 2exp 



a 



(i) 



The proof of the Azuma-Hoeffding inequality serves also to present the basic principles on which the martingale 
approach for proving concentration results is based on. Therefore, we present in the following the proof of this 
inequality. 

Proof: For an arbitrary a > 0, 

F(\X n - X \ > a) = P(X n -X >a) + F(X n - X < -a). (2) 

Let £j = Xi — Xi-i for i = 1, . . . ,n designate the jumps of the martingale. Then, it follows by assumption that 
|6c| < dt and E[£& | Tk-i] = a.s. for every k G {1, . . . , n}. 
From Chernoff 's inequality, for every t > 0, 

HX n -X >a) 



= P C>>>a) 



(3) 



For every t > 



E 



exp 



k=i 



E 



E 



E 



E 



i 



exp 



k=i 

n-l 



exp I t ^2 Cfe I exp(^ n ) | T n -x 



k=i 



n-l 



k=l 



exp I t ^2 6 ) E[exp(tf n ) | F n -i] 



(4) 



'The Azuma-Hoeffding inequality is also known as Azuma's inequality. Since it is referred numerous times in this chapter, it will be 
named Azuma's inequality for the sake of brevity. 
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where the last transition holds since Y = exp(i Y^~| is T n -\ -measurable. The measurability of Y is due 
to fact that ^ = X k — X k -\ is Fk -measurable for every k G N, and T k C T n -\ for < k < n — 1 since 
{-^fe}fc=o * s a filtration; hence, the RV Efc=i £fc and its exponentiation (y) are both T n -\ -measurable, and a.s. 
W v XY\T n - X \=YW v X\T n - 1 \. 

Due to the convexity of the exponential function, and since |^| < dk, then the straight line connecting the end 
points of the exponential function is below this function over the interval [—dk, dk], so a.s. for every k 

E[e«*|Jifc_i] 

■(d k + + (i k -d k )e- td * 



< E 



2d k 



k-l 



= ( e ««. + e ~ tdk ) 
= cosh (ic4). 



(5) 



Since, for every integer m > 0, 



(2m)! > (2m)(2m - 2) . . . 2 = 2 m m! 
then due to the power series expansion of the hyperbolic cosine and exponential functions, we have 



cosh(tdfc) = ^ 



(td h 



\2m 



m=0 



(2m)! 



m=0 



2 m m! 



e 2 



which therefore implies that 

E[e^|^fc-i] 

Hence, by repeatedly using the recursion in (4), it follows that 



E 



exp 



/ n \~\ n 



exp 



* 2 4 



exp 



fc=l ' J fe=l 
which then gives from (3) that, for every t > 0, 

P(X„ - X > a) < exp ^-at + 

An optimization over the free parameter t > gives that 

F(X n -X >a)< exp 




i2 " \ 
fc=l / 



a 



(6) 



(7) 



(8) 



2 £2=1*2 

Since, by assumption, {Xj.,^} is a martingale with bounded jumps, so is {— X k ,Tk} (with the same bounds on 
its jumps). This implies that the same bound is also valid for the probability f(X n — Xq < —a) and together with 
(2) it completes the proof of the Azuma-Hoeffding inequality. ■ 

The proof of the Azuma-Hoeffding inequality will be revisited later in this chapter for the derivation of some 
refined versions of Azuma's inequality, whose use and advantage will be also exemplified. 

Remark 4: In [50, Theorem 3.13], Azuma's inequality is stated as follows: Let {Yj., F^kLo t> e a martingale- 
difference sequence with Yq = (i.e., Yk is T k -measurable, E[|Yfc|] < oo and E[Yjfc|.Ffc_i] = a.s. for every k G N). 
Assume that, for every fceN, there exist numbers a k , bk G K such that a.s. a& < Yk < b k . Then, for every r > 0, 

2r 2 

< 2 exp 



*;=i 



> r 



(9) 



Hence, consider a discrete-parameter real-valued martingale sequence {Xk, J 7 k }'kLo where a>k < Xk — X k -i < bk 
a.s. for every k G N. Let Yfc = Xk — Xk-i for every G N. This implies that {Y k , Tk}kLo * s a martingale-difference 
sequence. From (9), it follows that for every r > 0, 

P (\X n - X \ > r) < 2 exp f- -.Y (10) 
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according to the setting in Theorem 1, a k = — d k and b k = d k for every k G N, which implies the equivalence 
between (1) and (10). 

As a special case of Theorem 1, let {X k , J 7 k}'kLo be a martingale sequence, and assume that there exists a 
constant d > such that a.s., for every k G N, \X k — X^-i] < d. Then, for every n G N and a > 0, 

P(|*n -X \> aVE) < 2exp • (H) 

Example 1: Let {Y}°^ be i.i.d. binary random variables which get the values ±d, for some constant d > 0, 
with equal probability. Let X k = Yli=o ^ for A; G {0, 1, ... , }, and define the natural filtration To C T\ C J" 2 . . . 
where 

j- fe = a(y ,...,n), vfc€{o,i,...,} 

is the u-algebra that is generated by the random variables Yo,...,Yj,. Note that {-X'fc,^jfc}^ =0 * s a mart i n g a l e 
sequence, and (a.s.) \X k — X k -i\ = |Yfc| = eZ, V A; G N. It therefore follows from Azuma's inequality in (11) that 

F{\X n -X \ >a^) <2exp(-|^ . (12) 

for every a > and n G N. From the central limit theorem (CLT), since the RVs {Y}^o ^ LLC ^- w i tn zero mean 
and variance d 2 , then -y=(X n — Xq) = -4= X]fc=i Yc converges in distribution to AA(0,d 2 ). Therefore, for every 

a > 0, 

lim F(\X n -X \ >aV^) = 2Q(^-) (13) 

n— >co \a/ 

where 

1 / t 2 \ 

Q{x) = ^= exp( )dt, VxGM (14) 

is the probability that a zero-mean and unit-variance Gaussian RV is larger than x. Since the following exponential 
upper and lower bounds on the Q-function hold 

1 X 1 x 2 

e~~ < Q(x) < —j=^ ■ e~~, Vx > (15) 



V2tt l + x 2 ' \/2ttx 

then it follows from (13) that the exponent on the right-hand side of (12) is the exact exponent in this example. 

Example 2: In continuation to Example 1, let 7 G (0, 1], and let us generalize this example by considering the 
case where the i.i.d. binary RVs {Y}°^ have the probability law 

P(Y = +d) = F(Yi = - 7 d) 



1+7 1+7 
Hence, it follows that the i.i.d. RVs {Y} have zero mean and variance a 2 = jd 2 as in Example 1. Let {Xk,J r k}fc = o 
be defined similarly to Example 1, so that it forms a martingale sequence. Based on the CLT, (X n — Xq) = 

^ 2~2k=i converges weakly to Af(0,^d 2 ), so for every a > 

lim F(\X n -X \ > ay /n) = 2Q (-?-). (16) 

n->oo V V7 " / 

From the exponential upper and lower bounds of the Q-function in (15), the right-hand side of (16) scales 

exponentially like e" 2 ^. Hence, the exponent in this example is improved by a factor ^ as compared Azuma's 
inequality (that is the same as in Example 1 since \X k — X k -i\ < d for every k G N). This indicates on the possible 
refinement of Azuma's inequality by introducing an additional constraint on the second moment. This route was 
studied extensively in the probability literature, and it is further studied in Section IV. 

Example 2 serves to motivate the introduction of an additional constraint on the conditional variance of a 
martingale sequence, i.e., adding an inequality constraint of the form 

Var(X fc I F k -{) = K[(X k - X k ^f \ F^] < 7 d 2 

where 7 G (0, 1] is a constant. Note that since, by assumption \X k — X k -\\ < d a.s. for every k G N, then the 
additional constraint becomes active when 7 < 1 (i.e., if 7 = 1, then this additional constraint is redundant, and it 
coincides with the setting of Azuma's inequality with a fixed d k (i.e., d k = d). 
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B. McDiardmid's Inequality 

The following useful inequality is due to McDiardmid ([49] or [51, Theorem 3.1]), and its original derivation 
uses the martingale approach for its derivation. We will relate, in the following, the derivation of this inequality to 
the derivation of the Azuma-Hoeffding inequality (see the previous sub-section). 

Theorem 2: [McDiardmid's inequality] Let {Xi} be independent real-valued random variables (not necessarily 
i.i.d.), and assume that Xi : Q{ — > R for every i. Let {Xi}f =1 be independent copies of {Xi}f =1 , respectively, and 
suppose that, for every k G {1, . . . , n}, 

\g{Xi, . . . ,X k _i,X k ,X k+ i, . . . ,X n ) - g(Xi, . . . ,X k _i,X k ,X k+ i, . . . ,X n )\ < d k (17) 

holds a.s. (note that a stronger condition would be to require that the variation of g w.r.t. the k-th coordinate of 
x G R n is upper bounded by d k , i.e., 

sup|#(x) -g{x)\ < d k 
for every x, x' G R n that differ only in their fc-th coordinate.) Then, for every a > 0, 

F(\g(X 1 ,...,X n )-E[g(X 1 ,...,X n )]\>a)<2e W (- A . (18) 

Remark 5: As we will see from the proof of this inequality, one could use the Azuma-Hoeffding inequality for 
proving it, but then the exponent will be four times smaller (i.e., the factor 2 in the exponent would have appeared 
in the denominator instead of appearing in the numerator. Hence, it will be observed from the proof that in the 
current setting, one gets a gain of a factor of 4 in the exponent. 

Proof: For k G {1, . . . , n}, let T k = a(Xi, . . . , X k ) be the cr-algebra that is generated by Xi, . . . , X k with 
To = {0, ^} being the minimal sigma-algebra. Define 

Z k ±E[g(X 1 ,...,X n )\T k \-E[g(X 1 ,...,X n )\T k _ 1 ], V k G {1, . . . , n}. (19) 

Note that To C T\ . . . C T n is a filtration, 

E [g(X 1 , . . . , X n ) | To] = E [g{X 1 , . . . , X n )] 

E[g(X 1 ,...,X n )\T n ] =g(X 1 ,...,X n ). (20) 

Hence 

g(X 1 ,...,X n )-E[g(X 1 ,...,X n )] 

= E^Xi, . . .,X n ) \T n ] - E[g(X lt ...,X n )\ T ] 

n 

= 5^ Mg(Xi, ...,X n )\T k ] - E[^(Xi, ...,X n )\ Jfc-i] } 
k=l 

n 

k=l 

In the following, we need the following lemma: 

Lemma 1: For every k G {1, . . . ,n}, the following properties hold a.s.: 

1) E[£fc | T k -i] = 0, so {^ k , T k } is a martingale-difference and £ k is T k -measurable. 

2) |Cfe| < d k 

3) £ k G [a k ,a k + d k ] where a k is some non-positive J r fc_ 1 -measurable. 

Proof: The random variable is T k -measurable since T k -\ C T k , and £ k is a difference of two functions where 
one is T k -measurable and the other one is J^-i-measurable. Furthermore, it is easy to verify that E[£& | T k -i] = 0. 
This verifies the first item, the second item follows from the third item. To prove the third item, let 

Ck = E[g(Xi, . . . ,X k _i,X k , X k+ i, . . . , X n ) | F k ] - E[g(Xi, ... , X k , X k+1 , . . . ,X n ) \ T k -i] 

Ck = E[g(Xi, . . . ,X k -i,X k , X k+ i, . . . , X n ) | Tk] - E[g(X 1 , ... , X k , X k+1 , . . . ,X n ) \ T k -±] 
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where {X{}f =l is an independent copy of {Xj}" =1 , and we define 

Fk = • • • , Xk-i,Xk). 

Due to the independence of X k and X k , and since they are also independent of the other RVs then a.s. 

= |E[<7(-Xi, . . . , Xfe, . . . , X n ) | Jfc] - E[g(Xi, . . . , X k _i, X k , X k+1 , . . . , X n ) \ JF k ]\ 

= |E[g(Xi, . . . , Xfc, . . . , X n ) — g(Xi, . . . , . . . , X n ) \ a(Xi, . . . , X k _ 1: X k , X k )]\ 

< ~E[\g(Xi, . . . ,X k _i,X k ,X k+1 , . . . ,X n ) - g(Xi, . . . , X k -i, X k , X k+1 , . . . ,X n )\ \ a(X±, . . . ,X k _i,X k ,X k )] 

< d k . (22) 

Therefore, \£ k — £ k \ < d k holds a.s. for every pair of independent copies X k and X k , which are also independent 
of the other random variables. This implies that £ k is a.s. supported on an interval [a k , a k + dk] for some function 
a-k = o, k (Xi, . . . ,X k _i) that is J r / £ _ 1 -measurable (since X k and X k are independent copies, and £ k — £ k is a 
difference of g(X±, . . . , X k -\,X k . . . , X n ) and g{X\, . . . , X k -\,X k . . . , X n ), then this is in essence saying that if 
a set S C R has the property that the distance between any of its two points is not larger than some d > 0, then the 
set should be included in an interval whose length is d). Since also E[£ k \ F k -\] = then a.s. the T k -i -measurable 
function a k is non-positive. It is noted that the third item of the lemma is what makes it different from the proof in 
the Azuma-Hoeffding inequality (which, in that case, implies that G [— d k ,d k ] where the length of the interval 
is twice large (i.e., 2d k ).) ■ 
Let b k = a k + d k . Since E[^ | F k -i] = and £ k G [a k , b k ] with a k < and b k are T k -\ -measurable, then 

Var(£ fc | F k -i) < -a k b k = a\. 

Applying the convexity of the exponential function gives (similarly to the derivation of the Azuma-Hoeffding 
inequality, but this time w.r.t. the interval [a k , b k ] whose length is only d k ) that, for every k G {1, . . . , n} 

E[e^\Tk-i]< hetak ~ aketbk . (23) 
dk 

Let Pfc 4_^ G [o, 1], then 

E[e«* | .Ffe-i] 

< Pk e tb » + (1 - p k )e ta » 

= e tak [l- Pk + Pk e td »] 

= e fk(t) (24) 

where f k (t) = ta k + ln(l — p k +p k e tdk ) for tel. Since f k (0) = f k (0) = and the geometric mean is less than 
or equal to the arithmetic mean then, for every t, 

fll(f) = d 2 kPk (l - Pk )e td « 4 
Jk[ > (l-p k + Pk e^f ~ 4 

which implies by Taylor's theorem that 

hit) < *-f- 

so, from (24), 

K[e^ | F k ^] < . 

Similarly to the proof of the Azuma-Hoeffding inequality, by repeatedly using the recursion in (4), the last inequality 
implies that 



E 



exp 



k=i 



(t£&) < exp (-5^4 J (25) 



k=i 
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which then gives from (3) that, for every t > 0, 

F(g(X 1 ,...,X n )-E[g(X 1 ,...,X n )]>a) 

n 

= P(J>>a) 



fc=i 



t 2 



<exp(^-ai+- YAj ■ (26) 

An optimization over the free parameter i > gives that 

P(g(Xi, ...,X n )- E[g(Xi, . . . ,X n )] > a) < exp (- \ . (27) 

Similarly to the derivation of the Azuma-Hoeffding inequality, this bound is also valid for the probability 

¥(g(X 1 ,...,X n )-E[g(X 1 ,...,X n )] > a), 
which therefore gives the bound in (18). ■ 

IV. Refined Versions of The Azuma-Hoeffding Inequality 
A. First Refinement of Azuma 's Inequality 

The following theorem appears in [49] and [21, Corollary 2.4.7]. 

Theorem 3: Let {Xk, J^kjkLo ^ e a discrete -parameter real-valued martingale. Assume that, for some constants 
d, a > 0, the following two requirements are satisfied a.s. 

\Xk - Xk-i\ < d, 

YM(X k \F k -i) = E[(X k - X k -!) 2 | < a 2 

for every k G {1, . . . , n}. Then, for every a > 0, 



F(\X n -X Q \ > an) <2exp[-nD 

where 



5 + 7 



1 + 7 



7 



1 + 7 



(28) 



^ '=5 (29) 

and 

£>(p|k)=plnQ+(l-p)ln( T ^), Vp, 9 €[0,l] (30) 

is the divergence (a.k.a. relative entropy or Kullback-Leibler distance) between the two probability distributions 
(p, 1 — p) and (q, 1 — q). If 5 > 1, then the probability on the left-hand side of (28) is equal to zero. 

Remark 6: From the above conditions then without any loss of generality, a 2 < d 2 and therefore 7 6 (0, 1]. 
Proof: The proof of this bound starts similarly to the proof of the Azuma-Hoeffding inequality, up to (4). 

The new ingredient in this proof is Bennett's inequality which replaces the argument of the convexity of the 
exponential function in the proof of the Azuma-Hoeffding inequality. From Bennett's inequality [10] (see, e.g., [21, 
Lemma 2.4.1]), if X is a real-valued random variable with x = E(X) and E[(X — x) 2 ] < a 2 for some a > 0, and 
X < b a.s. for some b G M, then for every A > 



2 



E[e xx ] < 



(b - x) 2 exp""^ +cr 2 e A ( 6 ~ s 



(b-x) 2 + a 2 (31) 

Applying Bennett's inequality for the conditional law of given the cr-algebra Tk-\, since E^lJ^.i] = 0, 
Varf^lJfc^i] < o~ 2 and < d a.s. for k G N, then a.s. 



a 2 exp(td) + d 2 exp (- ^ 



E [exp(^fc) I Fk-i] < d2 - a2 -■ (32) 
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Hence, it follows from (4) and (32) that, for every t > 0, 

V exp(td) + d 2 exp 



E 



exp I i^6c 



fc=i 



< 



d 2 + a 2 



E 



exp 



✓ n— 1 



fc=i 



and, by induction, it follows that for every t > 



E 



exp 



fc=i 



< 



' a 2 exp(td) + d 2 exp 



d 2 + a 2 



From the definition of 7 in (29), this inequality is rewritten as 

/ 7 exp(t(i) + exp(— 7td) \ ™ 

k=l 



E 



exp 



1 + 7 



, Vt > 0. 



(33) 



Let x = td (so x > 0). Combining Chernoff 's inequality with (33) gives that, for every a > (where from the 
definition of 5 in (29), at = 5x), 

P(X n -X > an) 



< exp(— ant) E 



exp 



*;=i 



< 



7exp((l — (5)x) + exp( — (7 + 5); 
1 + 7 



, Vx > 0. 



(34) 



Consider first the case where 5 = 1 (i.e., a = d), then (34) is particularized to 

7 + exp (-(7 + \)x) \ 



P(X n -X > dn) < 



1 + 7 



Vx > 



and the tightest bound within this form is obtained in the limit where x — > 00. This provides the inequality 



P(X n - X > dn) < 



7 



1 + 7 



(35) 



Otherwise, if S G [0, 1), the minimization of the base of the exponent on the right-hand side of (34) w.r.t. the free 
non-negative parameter x yields that the optimized value is 

7 + 5 



x 



1 + 7 



In 



7(1-5) 



(36) 



and its substitution into the right-hand side of (34) gives that, for every a > 0, 



P(X n - X > an) 



< 



7 + 5 

7 



1+5 



(1 - <J)~H^ 



'5 + 7 



5) 



= exp — n D 



1 + 7 



1 + 7 



(37) 



and the exponent is equal to +00 if S > 1 (i.e., if a > rf). Applying inequality (37) to the martingale {— X^, J 7 k}'kLo 
gives the same upper bound to the other tail-probability ¥(X n — Xq < —an). The probability of the union of the 
two disjoint events {X n — Xq > an} and {X n — Xq < —an}, that is equal to the sum of their probabilities, 
therefore satisfies the upper bound in (28). This completes the proof of Theorem 3. ■ 
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Example 3: Let d > and e G (0, \\ be some constants. Consider a discrete-time real-valued martingale 

{Xk,J r k}'k' = Q where a.s. Xq = 0, and for every m G N 

^{X m — X m _i = d | T m-i) = e , 
ed 



X m X. 



m—l 



1 



m— 1 



This indeed implies that a.s. for every m G N 

EpT m — X m _i | Tjn-i] = ed + 
and since X m _i is J" m _i -measurable then a.s. 



ed 



1 



(1 - e) = 



Since e G (0, ^] then a.s. 



E[X m | Tm-i] — X m -\. 

ed 



\X m ~ X m -i\ < max { d, 



From Azuma's inequality, for every x > 0, 



F(X k > kx) < exp 



1-e 

kx 2 
2d 2 



d. 



(38) 



independently of the value of e (note that Xq = a.s.). The concentration inequality in Theorem 3 enables one to 
get a better bound: Since a.s., for every m G N, 



E[(X ri 
then from (29) 

and from (37), for every x > 0, 



X„ 



m—l 



) 2 I T 



m-l 



7 



d 2 e + 



ed 
l-e 

x 



(1-e) 



d 2 e 
1 - e 



5 = 



d 



P(X fc > kx) < exp 



{-" D ( 



x(l — e) 
d 



+ e 



'))■ 



(39) 



Consider the case where e — > 0. Then, for arbitrary x > and fc G N, Azuma's inequality in (38) provides an upper 
bound that is strictly positive independently of e, whereas the one-sided concentration inequality of Theorem 3 
implies a bound in (39) that tends to zero. This exemplifies the improvement that is obtained by Theorem 3 in 
comparison to Azuma's inequality. 

Remark 7: As was noted, e.g., in [50, Section 2], all the concentration inequalities for martingales whose 
derivation is based on Chernoff 's bound can be strengthened to refer to maxima. The reason is that {X^—Xq, F^kLo 
is a martingale, and h(x) = exp(tx) is a convex function on M for every t > 0. Recall that a composition of a 
convex function with a martingale gives a sub-martingale w.r.t. the same filtration (see Section II-B), so it implies 
that {exp(t(Xh — Xq)), Tk}^ =0 is a sub-martingale for every t > 0. Hence, by applying Doob's maximal inequality 
for sub-martingales, it follows that for every a > 



P ( max Xk — Xn > an ) 

\l<k<n ) 

= lr I max exp (t(X k - X )) > exp(ant) t > 

Vl<fc<n / 

< exp(— ant) E exp(t(X n — Xo)) 



= exp(— ant) E 



exp 



k=l 



which coincides with the proof of Theorem 3 with the starting point in (3). This concept applies to all the 
concentration inequalities derived in this chapter. 
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Corollary 1: In the setting of Theorem 3, for every a > 0, 

F(\X n - X \ > an) < 2 exp f -2n ( — — ) I- (40) 




Proof: This concentration inequality is a loosened version of Theorem 3. From Pinsker's inequality, 

D(p\\q)>^-, Vp,g€[0,l] (41) 



where 

A 



V±\\(p,l-p)-(q,l-q)\\ 1 =2\p-q\ (42) 
denotes the L 1 -variational distance between the two probability distributions. Hence, for 7, <5 € [0, 1] 



D 



(5 + 7 



1 + 7 



7 \ ( 8 



2 



Remark 8: As was shown in the proof of Corollary 1, the loosening of the exponential bound in Theorem 3 
by using Pinsker's inequality gives inequality (40). Note that (40) forms a generalization of Azuma's inequality 
in Theorem 1 for the special case where, for every i, di = d for some d > 0. Inequality (40) is particularized to 
Azuma's inequality when 7 = 1, and then 



( nb 2 \ 

H\X n - X \ > an) < 2 exp I -—J . (43) 

This is consistent with the observation that if 7 = 1 then, from (29), the requirement in Theorem 3 for the conditional 
variance of the bounded-difference martingale sequence becomes redundant (since if \Xk — Xk-i\ < d a.s. then 
also E[(Xfc — Xk_i) 2 I Tk-i] < d?). Hence, if 7 = 1, the concentration inequality in Theorem 3 is derived under 
the same setting as of Azuma's inequality. 

Corollary 2: Let {X^, ^}^ =0 be a discrete -parameter real-valued martingale, and assume that for some constant 

d>0 

\Xk - Xk-i\ < d 
a.s. for every k € {1, . . . , n}. Then, for every a > 0, 

F(\X n -X \>an)< 2 exp (-«/($)) (44) 

where 

f(S) = 



ln(2) 1 - h 2 (Y) 



< 5 < 1 

" " (45) 

+00, 5 > 1 



and h,2(x) = — xlog 2 (x) — (1 — x) log 2 (l — x) for < x < 1 denotes the binary entropy function on base 2. 

Proof: By substituting 7 = 1 in Theorem 3 (i.e., since there is no constraint on the conditional variance, then 
one can take a 2 = d 2 ), the corresponding exponent in (28) is equal to 



D 



1 + 5 



\ ) - m 



since = m2[l — /12O?)] for every p£ [0,1]. ■ 

Remark 9: Based on Remark 8, and since Corollary 2 is a special case of Corollary 1 when 7 = 1, then it 
follows that Corollary 2 is a tightened version of Azuma's inequality. This can be verified directly, by showing that 
f(5) > 4r for every 5 > 0. This inequality is trivial for 5 > 1 since / is by definition infinity. For 5 G (0, 1], the 
power series expansion of / in (45) is given by 

~ s 2 p 5 2 5 4 5 & 5 8 5 10 

^) = E2K27^ = T + 12 + 30 + 56 + 90--- (46) 

p=i 
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which indeed proves the inequality also for 5 € (0, 1]. Figure 1 shows that the two exponents in (43) and (44) nearly 
coincide for 5 < 0.4. Also, the improvement in the exponent of Corollary 2, as compared to Azuma's inequality, 
is by factor 2 In 2 « 1.386 for 5 = 1. 

Discussion 1: Corollary 2 can be re-derived by the replacement of Bennett's inequality in (32) with the inequality 

E[exp(t&)|.F fc _i] < X - [e td + e~ td ] = cosh(id) (47) 

that holds a.s. due to the assumption that |^| < d (a.s.) for every k. The geometric interpretation of this inequality 
is based on the convexity of the exponential function, which implies that its curve is below the line segment that 
intersects this curve at the two endpoints of the interval [— d, d\. Hence, 

exp( t& )<i(l + |) e « + i(l-|) e - (48, 

a.s. for every k 6 N (or vice versa since N is a countable set). Since, by assumption, {Xk, Th}™^ is a martingale 
then E[^|J r fc_ 1 ] = a.s. for every k G N, so (47) indeed follows from (48). Combined with Chernoff's inequality, 
it yields (after making the substitution x = td where x > 0) that 

¥{X n -X > an) < (exp(-fe) cosh(a;)) n , Vi>0. (49) 

This inequality leads to the derivation of Azuma's inequality. The difference that makes Corollary 2 be a tightened 
version of Azuma's inequality is that in the derivation of Azuma's inequality, the hyperbolic cosine is replaced 
with the bound cosh(x) < exp(^-) so the inequality in (49) is loosened, and then the free parameter x > is 
optimized to obtain Azuma's inequality in Theorem 1 for the special case where dk = d for every k € N (note that 
Azuma's inequality handles the more general case where dk is not a fixed value for every k). In the case where 
dk = d for every k, Corollary 2 is obtained by an optimization of the non-negative parameter x in (49). If 5 G [0, 1], 
then by setting to zero the derivative of the logarithm of the right-hand side of (49), it follows that the optimized 
value is equal to x = tanh -1 (<5). Substituting this value into the right-hand side of (49) provides the concentration 
inequality in Corollary 2; to this end, one needs to rely on the identities 




0.2 0.4 0.6 0.8 1 1.2 



5 = a/d 



Fig. 1. Plot of the lower bounds on the exponents from Azuma's inequality in (43) and the refined inequalities in Theorem 3 and Corollary 2 
(where / is defined in (45)). The pointed line refers to the exponent in Corollary 2, and the three solid lines for 7 = | , \ and | refer to 
the exponents in Theorem 3. 

In the following, a known loosened version of Theorem 3 is re-derived based on Theorem 3. 
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Lemma 2: For every x, y G [0,1] 



D 



x + 



1 + 2/ 



l + : 







where 



B{u) 



a 2[(1 + u) ln(l + u) 



J/' 
V« > 0. 



IT 



(50) 



(51) 



Proof: This inequality follows by calculus, and it appears in [21, Exercise 2.4.21 (a)]. ■ 
Corollary 3: Let {^fc^fclfcLo b e a discrete -parameter real-valued martingale that satisfies the conditions in 
Theorem 3. Then, for every a > 0, 



P(|X„ - X \ > an) < 2exp -n 7 



7/ V 7 



(52) 



where 7, 5 G [0, 1] are introduced in (29). 

Proof: This inequality follows directly by combining inequalities (28) and (50) with the equality in (183). 



B. Geometric Interpretation 

A common ingredient in proving Azuma's inequality, and Theorem 3 is a derivation of an upper bound on the 
conditional expectation E[e'^ fc |.Ffc-i] for t > where |.Ffc-i] = 0, Var^l^jfc-i] < c 2 , and < d a.s. 
for some a, d > and for every k G N. The derivation of Azuma's inequality and Corollary 2 is based on the line 
segment that connects the curve of the exponent y(x) = e tx at the endpoints of the interval [— d, d]; due to the 
convexity of y, this chord is above the curve of the exponential function y over the interval [—d, d\. The derivation 
of Theorem 3 is based on Bennett's inequality which is applied to the conditional expectation above. The proof 
of Bennett's inequality (see, e.g., [21, Lemma 2.4.1]) is shortly reviewed, while adopting its proof to our notation, 
for the continuation of this discussion. Let X be a random variable with zero mean and variance .E[X 2 ] = a 2 , and 
assume that X < d a.s. for some d > 0. Let 7 — ^2. The geometric viewpoint of Bennett's inequality is based on 
the derivation of an upper bound on the exponential function y over the interval (— 00, d]; this upper bound on y is 
a parabola that intersects y at the right endpoint (d, e td ) and is tangent to the curve of y at the point (—jd, e~ tld ). 
As is verified in the proof of [21, Lemma 2.4.1], it leads to the inequality y(x) < ip(x) for every x G (—oo,d] 
where ip is the parabola that satisfies the conditions 

¥>(d) = y(d) = e td , 
tpi-fd) = y(-jd) = e-^ d , 
V '{- 1 d) = y'(- 1 d)=te-^ d . 

Calculation shows that this parabola admits the form 

. , (x + -fd)e td + (d- x)e-^ d a[^d 2 + (1 - 7 )<2 x - x 2 } 
V 9 ^) = 7^— — 73 ^ 



(l + 7)d (l + 7) 2 d 2 

where a = [(1 + y)td + lje - *^ - e td . At this point, since E[X] = 0, E[X 2 } = ~fd 2 and X < d a.s., then the 
following bound holds: 

E[e tx ] 
<E[<p(X)] 

ie td + e - 1 td f 1 d 2 -E\X 2 } 
+ a 



1 + 7 V (l+7) 2 ^ 2 

rygtd _|_ g— ytd 

1 + 7 

E[X 2 }e td + d 2 e- m f 1 
d 2 +E[X 2 } 
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which indeed proves Bennett's inequality in the considered setting, and it also provides a geometric viewpoint to 
the proof of this inequality. Note that under the above assumption, the bound is achieved with equality when X is 
a RV that gets the two values +d and —jd with probabilities and respectively. This bound also holds 
when E[X 2 ] < a 2 since the right-hand side of the inequality is a monotonic non-decreasing function of E[X 2 ] (as 
it was verified in the proof of [21, Lemma 2.4.1]). Applying Bennett's inequality to the conditional law of given 
T k -i gives (32) (with 7 in (29)). 



C. Another Approach for the Derivation of a Refinement of Azuma's Inequality 

Theorem 4: Let {Xk-,Fk}'k'=o be a discrete -parameter real-valued martingale, and let m G N be an even number. 
Assume that the following conditions hold a.s. for every k G N 



Xk - X k _i\ < d, 



< m, l = 2,...,m 



for some d > and non-negative numbers {w}™2- Then, for every a > 0, 



F(\X n - X Q \ >na)<2{ inf e" 

x>0 



m—1 



1 + 2^ 7i + ^ e ~ 1 " X > 



1=2 



where 



I A a A M w ; o 

5 = - d , 7i = j, Vi = 2,...,m. 



(53) 



(54) 



Proof: The starting point of this proof relies on (3) and (4) that were used for the derivation of Theorem 3. 
From this point, we deviate from the proof of Theorem 3. For every k G N and t > 



E[exp(^ fc )|J]fc-i] 

= l+tE[&|.F fc _i] + ...+ 



jtn-l 



(m- 1)! 



+E 



j.m— 1 



exp(t^ fc ) -l-t£k 



(6 



\m— 1 



(m-1)! 



l+tE[&|.Ffc-i] + ...+ 



f 



m— 1 



(m-1)! 



\m— 1 



+ — -E[(a)> m (^)l^-i] 



ml 



where 



¥>m(y) 



1 



if y ^0 
ify = 



(55) 



(56) 



In order to proceed, we need the following lemma: 

Lemma 3: For every m G N, the function 99 m has the following properties: 

1) lim^o <Pm(y) = 1, so ip m is a continuous function. 

2) 99 m is a positive function over the real line. 

3) (p m is monotonic increasing over the interval [0, 00). 

4) < ip m {y) < 1 for every y < 0. 

Proof: See Appendix A. ■ 
Remark 10: Note that [28, Lemma 3.1] states that 992 is a monotonic increasing and non-negative function over 
the real line. In general, for m G N, it is easier to prove the weaker properties of ip m that are stated in Lemma 3; 
these are sufficient for the continuation of the proof of Theorem 4. 

From (55) and Lemma 3, since < d a.s., then it follows that for an arbitrary t > 



<Pm(t£k) < <Pm(td), VfcGN 



(57) 
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a.s. (to see this, lets separate the two cases where is either non-negative or negative. If < < d a.s. then, for 
t > 0, inequality (57) holds (a.s.) due to the monotonicity of ip m over [0, oo). If < then the second and third 
properties in Lemma 3 yield that, for t > and every k G N, 

¥>m(*£fc) < 1 = Vm(0) < <Pm{td), 

so in both cases inequality (57) is satisfied). Since m is even then {ik) m > (note that although Lemma 3 holds 
in general for every m G N, this is the point where we need m to be an even number), and 

E[(£ fe )"> m (t&.)l^-i] < <Mt<0E[(&) ro |.F fc _i], Vt > 0. 
Also, since {-Xfc, < Ffc}£? =0 * s a mart ingale then E[^|J 7 / t„ 1 ] = 0, and based on the assumptions of this theorem 

^Ukf^k-i] < M = d l lh VZ€ {2,...,m}. 
By substituting the last three results on the right-hand side of (55), it follows that for every t > and every k G N 

11 (td) 1 , lm (td) m y m (td) 

2 

so from (4) 



E 



[exp(^)l^-i] <1+E^ + 



m! 



(58) 



exp t 



7t 

E«0 



fc=i 



m—1 



71 N)' , Jm(td) m ip m (td) 



+ 



ml 



, Vt>0. 



(59) 



From (3), if a > is arbitrary, then for every i > 



m—1 



¥(X n -X > an) < exp(-ant) ( 1 + E j! + 



7l (td)' , J m (td) m <p m (td) 



1=2 



ml 



Let x = td. Then, based on (29) and (56), for every a > 

P(X„ - X > an) 



< ^ inf e~ Sx 

x>0 



i + E ^— + 7m 



Z=2 



/! 



m! 



= < inf e 

x>0 



= < inf e" 

x>0 



-Sx 



-8x 



m—1 ] 

1 + 2^ -n- + 7n 



z=2 

m—1 



Z=2 



Z! 



(7j ~ 7m) a;' 



m—1 ; 

Z=0 " 



+ 7m (e x - 1 - x) 



(60) 



Applying inequality (60) to the martingale {—Xk, J-'^kLo gives the same bound on the probability ¥(X n — 
Xq < —an). Finally, the concentration inequality in (53) follows by summing the common upper bound for 
the probabilities of the two disjoint events {X n — Xq > an} and {X n — Xq < —an}. This completes the proof of 
Theorem 4. ■ 

Remark 11: Without any loss of generality, it is assumed that a G [0, d] (as otherwise, the considered probability 
is zero for a > d). Based on the above conditions, it is also assumed that m < d l for every I G {2, . . . , m}. Hence, 
5 G [0, 1], and 7/ G [0, 1] for all values of I. Note that, from (29), 72 = 7 . 

Remark 12: From the proof of Theorem 4, it follows that the one-sided inequality (60) is satisfied if the martingale 
{Xk,Tk}^ =0 fulfills the following conditions a.s. 

Xk — Xk-i < d, 

E[(X fc -X fc _i) J |J" fc _i] < W , l = 2,...,m 
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C( 7 ,<5) = i-ln( 7 (e^-l)). 
C( 7 , 5) = Sx - ln(l + 7^ - 1 - x)) 



for some positive number d > and a sequence of arbitrary numbers {/U;}[!1 2 . Note that these conditions are weaker 
than those that are stated in Theorem 4. Under these weaker conditions, 7« — may be larger than 1 or negative. 
This remark will be helpful later in this chapter. 

Remark 13: The infimum in (53) of Theorem 4 is attained and thus is a minimum. To show it, let f(x) for 
x G M + be the base of the exponent in (53), so we need to prove that L = inf 

xeR+ f( x ) is attained. The infimum 

is well defined since f > 0. Moreover, linLjj—^oo f{x) = oo. Indeed 

f{x) = e- 5x g{x)+ lm e^\ Vi G M. + 

for some polynomial g, so for 5 G (0, 1), the first term tends to zero and the second tends to infinity as x — > oo. 
This implies that there exists some A > such that f(x) > 1 for every x > A. As /(0) = 1, one can reduce the 
set over which the infimum of / is taken to the closed interval [0, A}. The claim follows from the continuity of /, 
and since every continuous function over a compact set attains its infimum. 

1) Specialization of Theorem 4 for m = 2: Theorem 4 with m = 2 (i.e., when the same conditions as of 
Theorem 3 hold) is expressible in closed form, as follows: 

Corollary 4: Let {Xk^^kLo be a discrete -parameter real-valued martingale that satisfies a.s. the conditions in 
Theorem 3. Then, for every a > 0, 

F(\X n - X \ > an) < 2 exp(-nC(7, 5)) 

where 7 and S are introduced in (29), and the exponent in this upper bound gets the following form: 
. If 5 > 1 then C( 7 , 8) = 00. 
. If 5 = 1 then 

• Otherwise, if 5 G (0, 1), then 
where x G (0, ^) is given by 

11 { (\ - 

(61) 

and Wo denotes the principal branch of the Lambert W function [17]. 

Proof: See Appendix B. ■ 
Proposition 1: If 7 < \ then Corollary 4 gives a stronger result than Corollary 2 (and, hence, it is also better 
than Azuma's inequality). 

Proof: See Appendix C. ■ 
It is of interest to compare the tightness of Theorem 3 and Corollary 4. This leads to the following conclusion: 
Proposition 2: The concentration inequality in Corollary 4 is looser than Theorem 3. 

Proof: See Appendix D. ■ 

2 ) Exploring the Dependence of the Bound in Theorem 4 in Terms of m: In the previous sub-section, a closed- 
form expression of Theorem 4 was obtained for the special case where m = 2 (see Corollary 4), but also 
Proposition 2 states that this special case is looser than Theorem 3 (which is also given in closed form). Hence, it 
is natural to enquire how does the bound in Theorem 4 vary in terms of m (where m > 2 is even), and if there is 
a chance to obtain an improvement over Theorem 3 by assigning some even values of m > 2 in Theorem 4. Also, 
due to the closed-form expression in Corollary 4, it would be pleasing to derive from Theorem 4 an inequality that 
is expressed in closed form for a general even value of m > 2. The continuation of the study in this sub-section 
is outlined as follows: 

• A loosened version of Theorem 4 is introduced, and it is shown to provide an inequality whose tightness 
consistently improves by increasing the value of m. For m = 2, this loosened version coincides with Theorem 4. 
Hence, it follows (by introducing this loosened version) that m = 2 provides the weakest bound in Theorem 4. 

• Inspired by the closed-form expression of the bound in Corollary 4, we derive a closed-form inequality (i.e., 
a bound that is not subject to numerical optimization) by either loosening Theorem 4 or further loosening 
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its looser version from the previous item. As will be exemplified numerically in Section VI, the closed-form 
expression of the new bound causes to a marginal loosening of Theorem 4. Also, for m = 2, it is exactly 
Theorem 4. 

• A necessary and sufficient condition is derived for the case where, for an even m > 4, Theorem 4 provides a 
bound that is exponentially advantageous over Theorem 3. Note however that, when m > 4 in Theorem 4, one 
needs to calculate conditional moments of the martingale differences that are of higher orders than 2; hence, 
an improvement in Theorem 4 is obtained at the expense of the need to calculate higher-order conditional 
moments. Saying this, note that the derivation of Theorem 4 deviates from the proof of Theorem 3 at an 
early stage, and it cannot be considered as a generalization of Theorem 3 when higher-order moments are 
available (as is also evidenced in Proposition 2 which demonstrates that, for m = 2, Theorem 4 is weaker 
than Theorem 3). 

• Finally, this sufficient condition is particularized in the asymptotic case where m — > oo. It is of interest since 
the tightness of the loosened version of Theorem 4 from the first item is improved by increasing the value of 
m. 

The analysis that is related to the above outline is presented in the following. Numerical results that are related to the 
comparison of Theorems 3 and 4 are relegated to Section VI (while considered in a certain communication-theoretic 
context). 

Corollary 5: Let {X^, Tk}k=o b e a discrete -parameter real-valued martingale, and let m G N be an even number. 
Assume that \Xk — -X"fe-i| < d holds a.s. for every k G N, and that there exists a (non-negative) sequence {//j}™ 2 
so that for every k G N 

w =E[LY fe -X fe _ 1 |'|.7V 1 ], VZ = 2,...,m. (62) 

Then, inequality (53) holds with the notation in (54). 

Proof: This corollary is a consequence of Theorem 4 since 

l E [(^fc ~ Xk-i) 1 1 Fk-i]\ < E[|X fe - X k _i\ l | Fk-i]- 



Proposition 3: Theorem 4 and Corollary 5 coincide for m = 2 (hence, Corollary 5 provides in this case the result 
stated in Corollary 4). Furthermore, the bound in Corollary 5 improves as the even value of m G N is increased. 
Proof: The proof is very technical, and it is omitted for the sake of brevity. ■ 

Inspired by the closed-form inequality that follows from Theorem 4 for m = 2 (see Corollary 4), a closed-form 
inequality is suggested in the following by either loosening Theorem 4 or Corollary 5. It generalizes the result in 
Corollary 4, and it coincides with Theorem 4 and Corollary 5 for m = 2. 

Corollary 6: Under the conditions of Corollary 5 then, for every a > 0, 



F(X n 



Xn > na) 



< 



-Sx 



ill ~ lm)x l 
l\ 



+ ~f m (e x -1-x) 



where {-ji}^ and S are introduced in (54), 



x 



a + b 



W [ - ■ e 



c \c 

with Wq that denotes the principal branch of the Lambert W function [17], and 



1 6 aWi i 

72 72 \d 



i-b. 



(63) 



(64) 



(65) 



Proof: See Appendix E. ■ 
Remark 14: It is exemplified numerically in Section VI that the replacement of the infimum over x > on 
the right-hand side of (53) with the sub-optimal choice of the value of x that is given in (64) and (65) implies a 
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marginal loosening in the exponent of the bound. Note also that, for m = 2, this value of x is optimal since it 
coincides with the exact value in (61). 

Corollary 7: Under the assumptions of Theorem 3 then, for every a > 0, 

P(X n -X > no) < e~ nE (66) 

where 

E = E 2 (^,5)±d( 5 -±^ t^-V (67) 
V 1 + 72 1 + 72 / 

Also, under the assumptions of Theorem 4 or Corollary 5 then (66) holds for every a > with 

E = E i {{ ll }™ 2 ,5) 



A 

= sup 

x>0 



Sx-ln ^l+g (7 ' ^ )XI + lm{e x - I - x) 



(68) 



where m > 2 is an arbitrary even number. Hence, Theorem 4 or Corollary 5 are better exponentially than Theorem 3 
if and only if E4 > E 2 . 

Proof: The proof follows directly from (37) and (60). ■ 

Remark 15: In order to avoid the operation of taking the supermum over x £ [0, oo), it is sufficient to first check 
if E4 > E 2 where 

E 4 ±5x -In ( 1+ J2 ^~^ xl +7m(e *_i_ x) j 
with the value of x in (64) and (65). This sufficient condition is exemplified later in Section VI. 



D. Concentration Inequalities for Small Deviations 

In the following, we consider the probability of the events {\X n — Xq\ > a^/n} for an arbitrary a > 0. 
These events correspond to small deviations. This is in contrast to events of the form {\X n — Xo\ > an}, whose 
probabilities were analyzed earlier in this section, referring to large deviations. 

Proposition 4: Let {X^, -7-fc}£l be a discrete -parameter real-valued martingale. Then, Theorem 3, and also 
Corollaries 3 and 4 imply that, for every a > 0, 

P(\X n -X \ > ajn) < 2exp(-— )(l + 0(rH)). (69) 

Also, under the conditions of Theorem 4, inequality (69) holds for every even m > 2 (so the conditional moments 
of higher order than 2 do not improve, via Theorem 4, the scaling of the upper bound in (69)). 

Proof: See Appendix F. ■ 
Remark 16: From Proposition 4, all the upper bounds on P(|X n — Xq\ > a^fn) (for an arbitrary a > 0) improve 
the exponent of Azuma's inequality by a factor of ^. 



E. Inequalities for Sub and Super Martingales 

Upper bounds on the probability P(X n — Xq > r) for r > 0, earlier derived in this section for martingales, can 
be adapted to super-martingales (similarly to, e.g., [15, Chapter 2] or [16, Section 2.7]). Alternatively, replacing 
{Xk, ^kYk=Q with {~^k,^k}k=o provides upper bounds on the probability P(X n — Xq< —r) for sub-martingales. 
For example, the adaptation of Theorem 3 to sub and super martingales gives the following theorem: 

Theorem 5: Let {Xk, -^j^Lo be a discrete-parameter real-valued super-martingale. Assume that, for some con- 
stants d, a > 0, the following two requirements are satisfied a.s. 



[X k 1 .7Vi] < d, 



Var(X fc |J- fc _ 



E 



(X k - E[X k I Jfc-i]) 2 I Fk-i 
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for every k G {1, . . . , n}. Then, for every a > 0, 



P(X n - X > an) < exp -n D 



5 + 7 



1 + 7 



7 



1 + 7 



(70) 



where 7 and 5 are defined as in (29), and the divergence Z?(p||g) is introduced in (30). Alternatively, if {X k , T k }' k ' = Q 
is a sub-martingale, the same upper bound in (70) holds for the probability P(X n — Xq < —an). If 5 > 1, then 
these two probabilities are equal to zero. 

Proof: The proof of this theorem is similar to the proof of Theorem 3. The only difference is that for a 
super-martingale, due to its basic property in Section II-B, 

n n 

x n — x = '^(x k - x k _i) < y^^fc 

k=l k=l 

a.s., where £ k = X k — K[X k \ T k -i] is -F^-measurable. Hence P((X n — Xq > an) < P(X)fc=i£fc ^ an ) where 
a.s. £k < d, I ^fe-i] = 0, and Var(^ | Tk-i) < <J 2 ■ The continuation of the proof coincides with the proof of 
Theorem 3 (starting from (3)). The other inequality for sub-martingales holds due to the fact that if {X k , F k } is a 
sub-martingale then {— X k , F k } is a super-martingale. ■ 



V. Relations of the Refined Inequalities to Some Classical Results in Probability Theory 
A. Relation between the Martingale Central Limit Theorem ( CLT) and Proposition 4 

In this subsection, we discuss the relation between the martingale CLT and the concentration inequalities for 
discrete -parameter martingales in Proposition 4. 

Let (ft, J 7 , P) be a probability space. Given a filtration {T k }, then {Y k , T^kLo * s sa id to be a martingale- 
difference sequence if, for every k, 

1) Y k is T k -measurable, 

2) E[|y fc |] < 00, 

3) E[y fc |J" fc _i] =0. 
Let 

n 

S n = J2 Y k, VnGN 
fc=i 

and Sq = 0, then {Sk,J 7 k}'k } =o * s a martingale. Assume that the sequence of RVs {Y k } is bounded, i.e., there exists 
a constant d such that \Y k \ < d a.s., and furthermore, assume that the limit 

1 n 

a 2 ± lim -J2^[y k 2 \n-i] 

n— s>oo n £ — ' 
k=l 

exists in probability and is positive. The martingale CLT asserts that, under the above conditions, converges in 
distribution (i.e., weakly converges) to the Gaussian distribution A/"(0, a 2 ). It is denoted by ^ =^ A/"(0, a 2 ). We 
note that there exist more general versions of this statement (see, e.g., [11, pp. 475-478]). 

Let {XkjJ-'^kLo b e a discrete-parameter real-valued martingale with bounded jumps, and assume that there 
exists a constant d so that a.s. for every fceN 

\X k -X k _i\<d, VfcGN. 

Define, for every k G N, 

Yk — X k - X k _i 

and Yq = 0, so {Y&, J^I^Iq is a martingale-difference sequence, and \Y k \ < d a.s. for every k G N U {0}. 
Furthermore, for every n G N, 

n 

S n — ^ Yfc = X n — Xq. 
k=l 
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Under the assumptions in Theorem 3 and its subsequences, for every k G N, one gets a.s. that 

E[Y 2 I F k ^} = E[(X k - X k ^) 2 | ? k _ x \ < a 2 . 
Lets assume that this inequality holds a.s. with equality. It follows from the martingale CLT that 

AA(0,a 2 ) 



X n — Xq 2 



and therefore, for every a > 0, 

lim F{\X n - X \ > ay/n) = 2Q(-) 

where the Q function is introduced in (218). 

Based on the notation in (29), the equality ^ = -£= holds, and 



lim F(\X n -X \>ay/Ti) = 2Q (-?-). (71) 



Since, for every x > 0, 

then it follows that for every a > 



1 / x 2 



Q(x) < - exp ( — — 



lim F{\X n - X \ > a^i) < exp ■ 



This inequality coincides with the asymptotic result of the inequalities in Proposition 4 (see (69) in the limit 
where n — > oo), except for the additional factor of 2. Note also that the proof of the concentration inequalities in 
Proposition 4 (see Appendix F) provides inequalities that are informative for finite n, and not only in the asymptotic 
case where n tends to infinity. Furthermore, due to the exponential upper and lower bounds of the Q-f unction in 
(15), then it follows from (71) that the exponent in the concentration inequality (69) (i.e., J^) cannot be improved 
under the above assumptions (unless some more information is available). 

B. Relation between the Law of the Iterated Logarithm (LIL) and Theorem 3 

In this subsection, we discuss the relation between the law of the iterated logarithm (LIL) and Theorem 3. 
According to the law of the iterated logarithm (see, e.g., [11, Theorem 9.5]) if {^fc}^ are i.i.d. real-valued 
RVs with zero mean and unit variance, and S n = Y^i=i X% for every n G N, then 

S 

lim sup n = 1 a.s. (72) 
n^oo V2nlnlnn 

and 

S 

liminf = = = — 1 a.s. (73) 
n^oo ^/2ralnmn 

Eqs. (72) and (73) assert, respectively, that for every e > 0, along almost any realization, 



S n > (1 - e)y/2n In Inn 

and 

S n < -(1 - e)V2nm Inn 

are satisfied infinitely often (i.o.). On the other hand, Eqs. (72) and (73) imply that along almost any realization, 
each of the two inequalities 

S n > (1 + e)V2n In Inn 

and 

S n < -(1 + e)V2nlnlnn 

is satisfied for a finite number of values of n. 
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Let {Xk}^ =l be i.i.d. real-valued RVs, defined over the probability space (17, J 7 , P), with E[Xl] = and 
E[Xf] = 1. 

Let us define the natural filtration where To = {0,^}, and Fk = a(X\, . . . , Xk) is the cr-algebra that is 
generated by the RVs X\ , . . . , X^ for every fceN. Let So = and S n be defined as above for every re G N. It is 
straightforward to verify by Definition 1 that {S n , Tn}^^ is a martingale. 

In order to apply Theorem 3 to the considered case, let us assume that the RVs {Xk}'^L 1 are uniformly bounded, 
i.e., it is assumed that there exists a constant c such that \X^\ < c a.s. for every k G N. Since E[Xf] = 1 then 
c > 1. This assumption implies that the martingale {S n , J-" n }^ has bounded jumps, and for every n G N 

\S n — S n —i\ < c a.s. 

Moreover, due to the independence of the RVs {Xk}^L v then 

Var(S n | J- n _!) = E(X 2 n | F n _{) = E(X%) = 

From Theorem 3, it follows that for every a > 

P (s n > aV2nlnlnn) < exp (^-nD^ + 



+ 7 



1 a.s. 



7 



1 + 7 



(74) 



where 



Straightforward calculation shows that 

V 1 + 7 



7 



a / 2 In In n 



n 



a 1 
T = "2- 



(75) 



1 + 7 



727 

" 1 + 7 

(a) n 7 
" 1 + 7 



(i + ^) m(i + ^)+I ( i-Oin(i-^ 

2 U 2 7/ 6 U 7 3 ^ 



n5 2 n<5£(l - 7) 



2 7 

00 2, , 
= a in Inn 



6 7 2 



+ ... 



a(c — 1) /In Inn 



6c 



n 



(76) 



where equality (a) follows from the power series expansion 

00 

(l + u)ln(l + u) = u + Y^ 



-u 



fc=2 



k(k ~ 1) 



-1< U < 1 



and equality (b) follows from (75). A substitution of (76) into (74) gives that, for every a > 0, 

P [S n > aV2nlnlnnj < (inn) L vv " >\ 



(77) 



and the same bound also applies to P(5 n < — aV2n In In n) for a > 0. This provides complementary information 
to the limits in (72) and (73) that are provided by the LIL. From Remark 7, which follows from Doob's maximal 
inequality for sub-martingales, the inequality in (77) can be strengthened to 



(max Sk > aV 2n In In n ) < (in n) 
l<k<n 1 



1+0 (v^F) 



(78) 
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It is shown in the following that (78) and the first Borel-Cantelli lemma can serve to prove one part of (72). Using 
this approach, it is shown that if a > 1, then the probability that S n > a\j2n In Inn i.o. is zero. To this end, let 
6 > 1 be set arbitrarily, and define 

A n = \J [S k > aV2Hnm£:} 

k-.e n - i <k<e n 

for every neN. Hence, the union of these sets is 

A±\jA n =(j{s k > aV2kln\nk} 

neN ken 

The following inequalities hold (since 9 > 1): 



P(Ai) < P jmax ^ S k > aV26' n - 1 lnln(6' n - 1 )^ 
= P ( max S k > ^"lnln^™" 1 ) ) 
< P ^max S k > -j= ^26 n lnln(6»™- 1 )^ 



< (nln#r^( 1+Al) (79) 
where the last inequality follows from (78) with (3 n — > as n — > oo. Since 

oo 

n r < oo, Va > V# 

n=l 

then it follows from the first Borel-Cantelli lemma that P(^4 i.o.) = for all a > \fd. But the event A does not 
depend on 0, and > 1 can be made arbitrarily close to 1. This asserts that F(A i.o.) = for every a > 1, or 
equivalently 

lim sup n = < 1 a.s. 
n->oo y2nlnlnn 

Similarly, by replacing {Aj} with {— Aj}, it follows that 

lim inf = — > —1 a.S. 

n-s>oo -^n In Inn 

Theorem 3 therefore gives inequality (78), and it implies one side in each of the two equalities for the LIL in (72) 
and (73). 

C. Relation of Theorems 3 and 4 with the Moderate Deviations Principle 

According to the moderate deviations theorem (see, e.g., [21, Theorem 3.7.1]) in R, let {Aj}™ =1 be a sequence 
of real-valued i.i.d. RVs such that Ax (A) = E[e AXl ] < oo in some neighborhood of zero, and also assume that 
E[Xj] = and a 2 = Var(Aj) > 0. Let {o n }^ =1 be a non-negative sequence such that a n — > and na n — > oo as 
n — > oo, and let 



I — n 

v n U 



(80) 



Then, for every measurable set V C 



1 ■ f 2 

-— k mi x 

2a 2 x&r° 

< lim inf a n lnP(Z„ G T) 

n— >oo 

< limsupa„lnP(Z„ G T) 

ra->oo 

< ^rinfx 2 (81) 

2a 2 xe r 
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where T and T designate, respectively, the interior and closure sets of V. 

Let 7] G 1) be an arbitrary fixed number, and let {a n }^ =1 be the non-negative sequence 

a n = n 1 - 2ri , Vn G N 

so that a n — > and na n — >■ oo as n — > oo. Let a G K + , and T = (—00, —a] U [a, 00). Note that, from (80), 

p ^ = p ( z - g r ) 

so from the moderate deviations principle (MDP), for every a > 0, 

v 2 



lim n 1 " 2 '' InP I V x\ > arp] = (82) 

n— >oo \\ ^— ^ I J 2(7 

It is demonstrated in Appendix G that, in contrast to Azuma's inequality, Theorems 3 and 4 (for every even m > 2 
in Theorem 4) provide upper bounds on the probability 



X>J>an" , Vn G N, a > 



\ i=i / 

which both coincide with the asymptotic limit in (82). The analysis in Appendix G provides another interesting 
link between Theorems 3 and 4 and a classical result in probability theory, which also emphasizes the significance 
of the refinements of Azuma's inequality. 

D. Relation of [50, Lemma 2.8] with Theorem 4 & Corollary 4 

In [50, Lemma 2.8], it is proved that if X is a random variable that satisfies E[X] = and X < d a.s. (for some 
d > 0), then 

E[e x ] <exp(p(d)Var(X)) (83) 

where 

<p{x) = < x 

{ \ if x = 

From (56), it follows that ip(x) = f or every x G K. Based on [50, Lemma 2.8], it follows that if {^k^k} is 
a difference-martingale sequence (i.e., for every k G N, 

E[& I F k -i] = 

a.s.), and ^ < d a.s. for some (i > 0, then for an arbitrary t > 

'7 (td) 2 ip 2 (td y 
2 

holds a.s. for every fc G N (the parameter 7 was introduced in (29)). The last inequality can be rewritten as 

E[exp(t&)|.Ffc-i] < exp (7 (exp(td) - 1 - td)) , t > 0. (84) 

This forms a looser bound on the conditional expectation, as compared to (58) with m = 2, that gets the form 

E[exp(^ fc )|JVi] < 1 + 7 (exp(icf) - 1 - td), t>0. (85) 

The improvement in (85) over (84) follows since e x > 1 + x for x > with equality if and only if x = 0. Note that 
the proof of [50, Lemma 2.8] shows that indeed the right-hand side of (85) forms an upper bound on the above 
conditional expectation, whereas it is loosened to the bound on the right-hand side of (84) in order to handle the 
case where 



E[exp(^ fe )|J r fe _ 1 ] < exp 



1 



k=i 
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and derive a closed-form solution of the optimized parameter t in the resulting concentration inequality (see the 
proof of [50, Theorem 2.7] for the case of independent RVs, and also [50, Theorem 3.15] for the setting of 
martingales with bounded jumps). However, if for every k € N, the condition 

E[(6) 2 | -F fc -i] <a 2 

holds a.s., then the proof of Corollary 4 shows that a closed-form solution of the non-negative free parameter t is 
obtained. More on the consequence of the difference between the bounds in (84) and (85) is considered in the next 
sub-section. 



E. Relation of the Concentration Inequalities for Martingales to Discrete-Time Markov Chains 

A striking well-known relation between discrete-time Markov chains and martingales is the following (see, e.g., 
[31, p. 473]): Let {X n } ng N (No = NU{0}) be a discrete-time Markov chain taking values in a countable state space 
S with transition matrix P, and let the function ip : S — > S be harmonic (i.e., YljeS Piji'ti) = VKOi Vie S), 
and assume that S[|'0(X n )|] < oo for every n. Then, {Y n ,J 7 n } n€ ^ is a martingale where Y n = ip(X n ) and 
{J r n}n&fi is the natural filtration. This relation, which follows directly from the Markov property, enables to apply 
the concentration inequalities in Section IV for harmonic functions of Markov chains when the function tp is 
bounded (so that the jumps of the martingale sequence are uniformly bounded). 

Exponential deviation bounds for an important class of Markov chains, called Doeblin chains (they are charac- 
terized by an exponentially fast convergence to the equilibrium, uniformly in the initial condition) were derived in 
[39]. These bounds were also shown to be essentially identical to the Hoeffding inequality in the special case of 
i.i.d. RVs (see [39, Remark 1]). 



F. Relations of [16, Theorem 2.23] with Corollary 4 and Proposition 4 

In the following, we consider the relation between the inequalities in Corollary 4 and Proposition 4 to the 
particularized form of [16, Theorem 2.23] (or also [15, Theorem 2.23]) in the setting where dk = d and ai = a 2 
are fixed for every k G N. The resulting exponents of these concentration inequalities are also compared. 

Let a > be an arbitrary non-negative number. 

• In the analysis of small deviations, the bound in [16, Theorem 2.23] is particularized to 



P(|X„ - X | > ay/n) < 2exp 



a 2 n 



From the notation in (29) then ^ = ^, and the last inequality gets the form 

mX n -X \>a^)<2exp(-^-) (l + o' ' 



27/ V V V™' 

It therefore follows that [16, Theorem 2.23] implies a concentration inequality of the form in (69). This shows 

that Proposition 4 can be also regarded as a consequence of [16, Theorem 2.23]. 

In the analysis of large deviations, the bound in [16, Theorem 2.23] is particularized to 



P(\X n - X \ > an) < 2exp 
From the notation in (29), this inequality is rewritten as 



2 

an 



2a 2 + 2f* 



/ 5 2 n \ 

P(\X n - X \ > an) < 2exp ^ ■ ( 86 ) 

It is claimed that the concentration inequality in (86) is looser than Corollary 4. This is a consequence of the 
proof of [16, Theorem 2.23] where the derived concentration inequality is loosened in order to handle the more 
general case, as compared to the setting in this chapter (see Theorem 3), where dk and a\ may depend on k. In 
order to show it explicitly, lets compare between the steps of the derivation of the bound in Corollary 4, and the 
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particularization of the derivation of [16, Theorem 2.23] in the special setting where c4 and a\ are independent of 
k. This comparison is considered in the following. The derivation of the concentration inequality in Corollary 4 
follows by substituting m = 2 in the proof of Theorem 4. It then follows that, for every a > 0, 

P(X n -X > an) < e- nSx (l + ~i(e x - 1 -a;))", Vx > (87) 

which then leads, after an analytic optimization of the free non-negative parameter x (see Lemma 6 and Appendix B), 
to the derivation of Corollary 4. On the other hand, the specialization of the proof of [16, Theorem 2.23] to the 
case where dk = d and a\ = a 2 for every k G N is equivalent to a further loosening of (87) to the bound 

P(X n - X > an) 

< e -n6x e n-y(e*-l-x) ( g g) 

n ( — 5x+ 7 X x I 

< e V 1 ~iJ t Vx G (0,3) (89) 

and then choosing an optimal x G (0,3). This indeed shows that Corollary 4 provides a concentration inequality 
that is more tight than the bound in [16, Theorem 2.23]. 

In order to compare quantitatively the exponents of the concentration inequalities in [16, Theorem 2.23] and 
Corollary 4, let us revisit the derivation of the upper bounds on the probability of the events {\X n — X$\ > an} 
where a > is arbitrary. The optimized value of x that is obtained in Appendix B is positive, and it becomes 
larger as we let the value of 7 G (0, 1] approach zero. Hence, especially for small values of 7, the loosening of 
the bound from (87) to (89) is expected to deteriorate more significantly the resulting bound in [16, Theorem 2.23] 
due to the restriction that x G (0,3); this is in contrast to the optimized value of x in Appendix B that may be 
above 3 for small values of 7, and it lies in general between and ^. Note also that at 5 = 1, the exponent in 
Corollary 4 tends to infinity in the limit where 7—^0, whereas the exponent in (86) tends in this case to |. To 
illustrate these differences, Figure 2 plots the exponents of the bounds in Corollary 4 and (86), where the latter 
refers to [16, Theorem 2.23], for 7 = 0.01 and 0.99. As is shown in Figure 2, the difference between the exponents 
of these two bounds is indeed more pronounced when 7 gets closer to zero. 



5 
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Fig. 2. A comparison of the exponents of the bound in Corollary 4 and the particularized bound (86) from [16, Theorem 2.23]. This 
comparison is done for both 7 = 0.01 and 0.99. The solid curves refer to the exponents of the bound in Corollary 4, and the dashed curves 
refer to the exponents of the looser bound in (86). The upper pair of curves refers to the exponents for 7 = 0.01, and the lower pair of 
curves (that approximately coincide) refers to the exponents for 7 = 0.99. 

Consider, on the other hand, the probability of an event {\X n — Xq\ > a^fn] where a > is arbitrary. It is 
shown in Appendix C that the optimized value of x for the bound in Corollary 4 (and its generalized version in 
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Theorem 4) scales like -^=. Hence, it is approximately zero for n>l, and u = 7(e :E — 1 — x) « ^ scales like 
It therefore follows that (1 + u) n « e nu for n > 1. Moreover, the restriction on x to be less than 3 in (89) 
does not affect the tightness of the bound in this case since the optimized value of x is anyway close to zero. This 
explains the observation that the two bounds in Proposition 4 and [16, Theorem 2.23] essentially scale similarly 
for small deviations of order y/n. 

VI. Applications in Information Theory and Related Topics 

The refined versions of Azuma's inequality in Section IV are applied in this section to information-theoretic 
aspects. The results in this section have been presented in part in [65], [66], [67] and [82]. 

A. Binary Hypothesis Testing 

Binary hypothesis testing for finite alphabet models was analyzed via the method of types, e.g., in [18, Chapter 1 1] 
and [19]. It is assumed that the data sequence is of a fixed length (n), and one wishes to make the optimal decision 
based on the received sequence and the Neyman-Pearson ratio test. 

Let the RVs Xi,X2-... be i.i.d. ~ Q, and consider two hypotheses: 

. H 1 :Q = P 1 . 

. H 2 :Q = P 2 . 

For the simplicity of the analysis, let us assume that the RVs are discrete, and take their values on a finite alphabet 
X where P\(x), P2(x) > for every x G X. 
In the following, let 



designate the log-likelihood ratio. By the strong law of large numbers (SLLN), if hypothesis H\ is true, then a.s. 

lim L(Xl, -" ,Xn) = £>(Pi||P 2 ) (90) 

n->oo n 

and otherwise, if hypothesis H2 is true, then a.s. 

lim L(Xl "" A) = -D(P 2 \\P 1 ) (91) 

where the above assumptions on the probability mass functions Pi and P2 imply that the relative entropies, 
D{P\\\P2) and D{P2\\P\), are both finite. Consider the case where for some fixed constants A, A G R that satisfy 



one decides on hypothesis H\ if 
and on hypothesis H2 if 



-D{P 2 \\Pi) < A < A < D(Pi\\P 2 ) 
L(X u ...,X n ) > nX 
L(X 1 ,...,X n ) < nX. 



Note that if A = A = A then a decision on the two hypotheses is based on comparing the normalized log-likelihood 
ratio (w.r.t. n) to a single threshold (A), and deciding on hypothesis R\ or H2 if it is, respectively, above or below 
A. If A < A then one decides on H\ or H 2 if the normalized log-likelihood ratio is, respectively, above the upper 
threshold A or below the lower threshold A. Otherwise, if the normalized log-likelihood ratio is between the upper 
and lower thresholds, then an erasure is declared and no decision is taken in this case. 
Let 

<#> ^ P{ 1 (L(X 1 ,...,X n )< nA) (92) 
cg)±P?(L(X u ...,X n )<n)?) (93) 
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and 

pW±P?(L(X 1 ,...,X n )>n\) (94) 
ffi A P ^L(X 1 ,...,X n ) >nA~) (95) 

then ai 1 ' and (3^ are the probabilities of either making an error or declaring an erasure under, respectively, 
hypotheses Hi and H^, similarly, a4 2 ' and (3^ are the probabilities of making an error under hypotheses Hi and 
H2, respectively. 

Let 7Ti,7T2 € (0, 1) denote the a-priori probabilities of the hypotheses Hi and H2, respectively, so 

P$ = Kia ( P + 7r 2 /?« (96) 
is the probability of having either an error or an erasure, and 

P$ = ^i«l 2) + rf' (97) 

is the probability of error. 

(?) (i) 

1) Exact Exponents: When we let n tend to infinity, the exact exponents of a„ and /3„ (j = 1, 2) are derived via 
Cramer's theorem. The resulting exponents form a straightforward generalization of, e.g., [21, Theorem 3.4.3] and 
[35, Theorem 6.4] that addresses the case where the decision is made based on a single threshold of the log-likelihood 
ratio. In this particular case where A = A = A, the option of erasures does not exist, and Pi,n = P<?n — Pe,n is the 
error probability. 

In the considered general case with erasures, let 

Ai = —A, A2 — —A 

then Cramer's theorem on R yields that the exact exponents of ai 1 ', offi, Pn^ and ^ are given by 

l im Jl^L = /( Al ) (98) 

n^oo n 

Jim J^n_ = /(A2) (99) 

n— >oo n 

ln/3 (1) 

li m ^!L_ = /(A 2 ) - A 2 (100) 

n^oo n 

lim ^- = I{Xi)-Xi (101) 



n— >oo n 



where the rate function / is given by 



and 



I (r) = sup (tr - H(t)) (102) 



tea 



H(t) = \nlj2 p i( x ) 1 ~ t P2(x) t ), Vt€R. (103) 

The rate function / is convex, lower semi-continuous (l.s.c.) and non-negative (see, e.g., [21] and [35]). Note that 

H{t) = {t-\)D t {P 2 \\Pi) 

where D t (P\\Q) designates Reyni's information divergence of order t [59, Eq. (3.3)], and I in (102) is the Fenchel- 
Legendre transform of H (see, e.g., [21, Definition 2.2.2]). 

From (96)- (101), the exact exponents of P^l and Pi,n are equal to 

lim = mini J(Ai), J(A 2 ) - A 2 } (104) 
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and 



lim 



(2) 



mm 



{l(A 2 ),I(Ai)-Ai}. 



(105) 



For the case where the decision is based on a single threshold for the log-likelihood ratio (i.e., Ai = A2 = A), 

ent is 

In P.. 



then P^n = Pen — Pe,n, and its error exponent is equal to 



lim - 

n— »oo 



mm 



n 



in{/(A),/(A)-A} 



(106) 



which coincides with the error exponent in [21, Theorem 3.4.3] (or [35, Theorem 6.4]). The optimal threshold for 
obtaining the best error exponent of the error probability P e „ is equal to zero (i.e., A = 0); in this case, the exact 
error exponent is equal to 



7(0) = — min In 

0<t<l 

= C(P 1 ,P 2 ) 



.1— * 



P2{x) t 



(107) 



which is the Chernoff information of the probability measures Pi and P2 (see [18, Eq. (1 1.239)]), and it is symmetric 
(i.e., C(Pi,P 2 ) = C7(P 2 ,Pi)). Note that, from (102), 7(0) = sup teR (-77(t)) = - mf teR (77(f)) ; the minimization 
in (107) over the interval [0, 1] (instead of taking the inflmum of 77 over M) is due to the fact that 77(0) = 77(1) = 
and the function 77 in (103) is convex, so it is enough to restrict the infimum of 77 to the closed interval [0, 1] for 
which it turns to be a minimum. 

Paper [12] considers binary hypothesis testing from an information-theoretic point of view, and it derives the 
error exponents of binary hypothesis testers in analogy to optimum channel codes via the use of relative entropy 
measures. We will further explore on this kind of analogy in the continuation to this section (see later Sections VI-A5 
and VI-A6 w.r.t. moderate and small deviations analysis of binary hypothesis testing). 

2 ) Lower Bound on the Exponents via Theorem 3: In the following, the tightness of Theorem 3 is examined by 
using it for the derivation of lower bounds on the error exponent and the exponent of the event of having either an 
error or an erasure. These results will be compared in the next sub-section to the exact exponents from the previous 
sub-section. 

We first derive a lower bound on the exponent of a>n \ Under hypothesis 77i, let us construct the martingale 
sequence {UkiFkYk=Q wnere -Po ^ Pi ^ . . . P n is the filtration 



and 



For every k G {0, . . . , n} 



P o = {0,fn, F k = <r(X 1 ,...,X k ), Vfc€{l,...,n} 



Uk = Ep» [L(Xi, ... , X n ) I P fe ] 



(108) 



U k = E Pr 



Pi{Xj) 

f"P2(Xi) 



In 



Pi(Xj) 

P2(Xi) 



= 2> 



PjXj 

f~P2(X i 



+ (n-k)D(P 1 \\P 2 ). 



In particular 



U = nD(P 1 \\P 2 ), 



tf» = Elnj^=£(*i,. -.,*„) 



(109) 
(110) 



30 



DRAFT. LAST UPDATED: OCTOBER 28, 2012 



and, for every k G {1, . . . , n}, 



Let 



d\ = max 



P 2 (X fc ) 



an) 



(112) 



so c?i < oo since by assumption the alphabet set X is finite, and Pi(x),P 2 (x) > for every x E X. From (111) 
and (112) 

|£4-£4_i| < di 

holds a.s. for every k G {1, . . . , n}, and due to the statistical independence of the RVs in the sequence {AQ} 

E Pr [(U k -U k ^) 2 \T k ^] 
Pi(X k ) 



= E 



Pi 



In 



P 2 (*fc) 



-£>(Piiip 2 ; 



A ^-2 

= 



Let 



£1,1 = D(Pi||P 2 ) " A, e 2 ,i = P(P 2 ||Pi) + A 

£l, 2 = D(Pi||P 2 ) — A, £ 2 , 2 = D(P 2 \\P l ) + A 



(113) 

(114) 
(115) 



The probability of making an erroneous decision on hypothesis H 2 or declaring an erasure under the hypothesis 



Pi is equal to a„ \ and from Theorem 3 

.W A p." 

(a) 



o;;- =Pi n (L(Xi,...,X n ) < nA) 



Pi n (P n -Po < -ei,m) 



(b) 

< exp — nZ> 



<5i,i + 7i 



l + 7i 



7i 



l + 7i 



(116) 
(117) 



where equality (a) follows from (109), (110) and (114), and inequality (b) follows from Theorem 3 with 

\ -V., ... (118) 



A °\ c A £l,l 

71 = -jz, 01,1 



.,(!) 



Note that if £1,1 > di then it follows from (111) and (112) that a„ is zero; in this case <5i,i > 1, so the divergence 
in (117) is infinity and the upper bound is also equal to zero. Hence, it is assumed without loss of generality that 

01,1 €[0,1]. 

Similarly to (108), under hypothesis H 2 , let us define the martingale sequence {Pfc, Pfc}JjL with the same 
filtration and 

U k = Epn [L(X 1 , X n ) | F k ] , Vfc€{0,. ..,«}. (119) 

For every k G {0, . . . , n} 



U k = ^ln^^ ) -(n-k)D(P 2 \\P 1 ) 



i=i 



and in particular 

For every k G {1, . . . , n}, 



U = -nD(P 2 \\P 1 ), U n = L(X u ...,X n ). 
Pi(X k ) 



U k -U; 



k-1 



In 



P 2 (X k 



+ D(P 2 \\P 1 ) 



(120) 
(121) 
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Let 



d 2 — max 



In 



P2(X) 

Pi{x) 



D(P 2 \\Pi, 



(122) 



then, the jumps of the latter martingale sequence are uniformly bounded by d 2 and, similarly to (113), for every 
fc€{l,...,n} 



KpA( u k-Uk-i) 2 \^k-i] 

P2(X) 



(X) 



-D(P 2 \\P 1 [ 



A ^2 
- <7 2 - 



(123) 



Hence, it follows from Theorem 3 that 



ffl =P 2 n {L(X 1 ,...,X n )>n\) 
= P?(U n -U > e 2 ,i n) 



< exp 



$2,1 + 72 



+ 72 



72 



1+72 



(124) 
(125) 



where the equality in (124) holds due to (120) and (114), and (125) follows from Theorem 3 with 



£2,1 



72 = 02,1 = -J- 



(126) 



and <i 2 > 02 are introduced, respectively, in (122) and (123). 

From (96), (117) and (125), the exponent of the probability of either having an error or an erasure is lower 
bounded by 



r lnP e ( ,n . . n /^,i +7i li \ 
lim — > mm D — . 

n-¥co n i=l,2 \ l+7j l + 7j/ 



li l + 7i 

Similarly to the above analysis, one gets from (97) and (115) that the error exponent is lower bounded by 



(127) 



mP e (2) 
lim 

n— too n 



— > mm D — 

i=l,2 V 1 



<$i,2 + 7i 



+ 7i 



7» 



l + 7i 



where 



c A e l,2 x A £ 2,2 



(128) 



(129) 



For the case of a single threshold (i.e., A = A = A) then (127) and (128) coincide, and one obtains that the error 
exponent satisfies 

lnP e . n . (5i +7i 7^ \ 

> mm D 

i=l,2 \l+7i 1+7;/ 



lim 



fl 



(130) 



where <5j is the common value of Si t i and 8^2 (for i = 1, 2). In this special case, the zero threshold is optimal (see, 
e.g., [21, p. 93]), which then yields that (130) is satisfied with 



D(Pi\\P 2 ) D{P 2 \\P 1 ) 
Ol — ; , 02 — 



di 



do 



(131) 



with d\ and d 2 from (112) and (122), respectively. The right-hand side of (130) forms a lower bound on Chernoff 
information which is the exact error exponent for this special case. 
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3) Comparison of the Lower Bounds on the Exponents with those that Follow from Azuma's Inequality: The 
lower bounds on the error exponent and the exponent of the probability of having either errors or erasures, that were 
derived in the previous sub-section via Theorem 3, are compared in the following to the loosened lower bounds on 
these exponents that follow from Azuma's inequality. 

We first obtain upper bounds on a^, affl, Pn^ and ^ via Azuma's inequality, and then use them to derive 
lower bounds on the exponents of Pi,n and Pi,n- 

From (111), (112), (116), (118), and Azuma's inequality 

a™ < expf - S -^-) (132) 



2 

and, similarly, from (121), (122), (124), (126), and Azuma's inequality 



/3«<ex P (-^M. (133) 



From (93), (95), (115), (129) and Azuma's inequality 



a® < ex p(-^ ] (134) 
^ <exp(-%^). (135) 



Therefore, it follows from (96), (97) and (132)— (135) that the resulting lower bounds on the exponents of pj,n and 
Pe,n are 

lnPi^ 

lim ^ > min j = 1, 2 (136) 

n^oo n i=l,2 2 



as compared to (127) and (128) which give, for j = 1, 2, 



lnP e ( ,n ^ _ : _ „/ , *ij+7< 



lim — > mm D — 

n^oo n i=l,2 V 1 



1 + 7 



(137) 



n^oo n i=l,2 V 1 + 7j 

For the specific case of a zero threshold, the lower bound on the error exponent which follows from Azuma's 
inequality is given by 

Jim _^£ki > min SL (138) 

n^oo n i=l,2 2 

with the values of <5i and 62 in (131). 

The lower bounds on the exponents in (136) and (137) are compared in the following. Note that the lower bounds 
in (136) are loosened as compared to those in (137) since they follow, respectively, from Azuma's inequality and 
its improvement in Theorem 3. 

The divergence in the exponent of (137) is equal to 

+ li 



D 



1+7 



7* 



1 + 7 



(1 + *iA \ n (i + + (lr_MMl^i) 

\ 7i/ ^ 7i' 7 



1 + 7 



(139) 



Lemma 4: 



[ u + \, u£ [-1,0] 

(1 + u) ln(l + u) > { \ (140) 
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where at u = — 1, the left-hand side is defined to be zero (it is the limit of this function when u — > — 1 from above). 

Proof: The proof follows by elementary calculus. ■ 
Since Sij 6 [0, 1], then (139) and Lemma 4 imply that 



D 



1 + 7* 



i + 7i ;-2 7i 6 7 2(i + 7i 



Hence, by comparing (136) with the combination of (137) and (141), then it follows that (up to a second-order 
approximation) the lower bounds on the exponents that were derived via Theorem 3 are improved by at least a 
factor of (max 7 j) as compared to those that follow from Azuma's inequality. 
Example 4: Consider two probability measures P\ and P 2 where 

Pi(0) = P 2 (l) = 0.4, P 1 (l) = P 2 (0) = 0.6, 

and the case of a single threshold of the log-likelihood ratio that is set to zero (i.e., A = 0). The exact error exponent 
in this case is Chernoff information that is equal to 

C{P 1 ,P 2 ) = 2.04- 1(T 2 . 

The improved lower bound on the error exponent in (130) and (131) is equal to 1.77 • 10~ 2 , whereas the loosened 
lower bound in (138) is equal to 1.39 • 10 -2 . In this case 7i = § and 72 = f , so the improvement in the lower 
bound on the error exponent is indeed by a factor of approximately 

(max 7i ) * = 

Note that, from (117), (125) and (132)— (135), these are lower bounds on the error exponents for any finite block 
length n, and not only asymptotically in the limit where n — > 00. The operational meaning of this example is that 
the improved lower bound on the error exponent assures that a fixed error probability can be obtained based on a 
sequence of i.i.d. RVs whose length is reduced by 22.2% as compared to the loosened bound which follows from 
Azuma's inequality. 

4) Comparison of the Exact and Lower Bounds on the Error Exponents, Followed by a Relation to Fisher 
Information: In the following, we compare the exact and lower bounds on the error exponents. Consider the case 
where there is a single threshold on the log-likelihood ratio (i.e., referring to the case where the erasure option is 
not provided) that is set to zero. The exact error exponent in this case is given by the Chernoff information (see 
(107)), and it will be compared to the two lower bounds on the error exponents that were derived in the previous 
two subsections. 

Let {Pejege, denote an indexed family of probability mass functions where O denotes the parameter set. Assume 
that Pq is differentiable in the parameter 9. Then, the Fisher information is defined as 



J(9) 4 E 



|lnP,(x) 



(142) 



where the expectation is w.r.t. the probability mass function Pq. The divergence and Fisher information are two 
related information measures, satisfying the equality 

D m p ei ) = m 

(6-6'f 2 K } 

(note that if it was a relative entropy to base 2 then the right-hand side of (143) would have been divided by In 2, 
and be equal to ^ as in [18, Eq. (12.364)]). 
Proposition 5: Under the above assumptions, 

• The Chernoff information and Fisher information are related information measures that satisfy the equality 

c{ Pe ,p e ,) = m 

0™ {e-e'Y s K } 
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Let 



E L (P e ,P e ,)± mm 

i=l,2 \ 1 



Si + 7i 



7» 



l+7i 



+ 7* 

be the lower bound on the error exponent in (130) which corresponds to Pi = Pq and P2 — Pe>, then also 

E L (Pe,Pe ') _ J{9) 
F-Xe (8-9') 

Let 



(145) 



llm — T^T = ( 146 ) 



E L (P e ,Pe') = minf (147) 

1=1,2 z 

be the loosened lower bound on the error exponent in (138) which refers to Pi = Pg and P2 — Pq' ■ Then, 

&(P,,P y ) a(g)J(g) 

for some deterministic function a bounded in [0,1], and there exists an indexed family of probability mass 

functions for which a(9) can be made arbitrarily close to zero for any fixed value of 9 G 0. 

Proof: See Appendix H. ■ 

Proposition 5 shows that, in the considered setting, the refined lower bound on the error exponent provides the 
correct behavior of the error exponent for a binary hypothesis testing when the relative entropy between the pair of 
probability mass functions that characterize the two hypotheses tends to zero. This stays in contrast to the loosened 
error exponent, which follows from Azuma's inequality, whose scaling may differ significantly from the correct 
exponent (for a concrete example, see the last part of the proof in Appendix H). 

Example 5: Consider the index family of of probability mass functions defined over the binary alphabet X = 
{0,1}: 

P e {0) = l-9, P e (l) = 9, V0€(O,1). 
From (142), the Fisher information is equal to 

J(0) = l+ ' 



1-9 

and, at the point 9 = 0.5, J(9) = 4. Let 0i = 0.51 and 9 2 = 0.49, so from (144) and (146) 



C(P ei ,P e2 ),E L (P ei ,Pe 2 ) « m(9l 8 U2J =2.00-KT 4 . 

Indeed, the exact values of C(P 01 ,P 02 ) and E L (P 6l ,P e2 ) are 2.000 • 10~ 4 and 1.997 • 10~ 4 , respectively. 

5) Moderate Deviations Analysis for Binary Hypothesis Testing: So far, we have discussed large deviations 
analysis for binary hypothesis testing, and compared the exact error exponents with lower bounds that follow from 
refined versions of Azuma's inequality. 

Based on the asymptotic results in (90) and (91), which hold a.s. under hypotheses H\ and H2 respectively, the 
large deviations analysis refers to upper and lower thresholds A and A which are kept fixed (i.e., these thresholds 
do not depend on the block length n of the data sequence) where 

-P(P 2 ||Pi) < A< A<7J(P 1 ||P 2 ). 

Suppose that instead of having some fixed upper and lower thresholds, one is interested to set these thresholds such 
that as the block length n tends to infinity, they tend to their asymptotic limits in (90) and (91), i.e., 

lim A W = P(Pi||P 2 ), lim A (n) = -D(P 2 \\P 1 ). 

n— >oo n— >oo 

Specifically, let rj G (5, 1), and £1,62 > be arbitrary fixed numbers, and consider the case where one decides on 
hypothesis Hi if 

L(Xi,...,X n ) >n\ in) 

and on hypothesis H 2 if 

L(Xi,...,X n ) <n\^ 



\2 
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where these upper and lower thresholds are set to 

A (n) =D(P 1 || J P 2 )-e 1 n-( 1 -^ 
AH = _D(P 2 ||P 1 ) + £2n -M 

so that they approach, respectively, the relative entropies D{P\\\P2) and — D(P2\\P\) in the asymptotic case where 
the block length n of the data sequence tends to infinity. Accordingly, the conditional probabilities in (92)-(95) are 

— — (ti) 

modified so that the fixed thresholds A and A are replaced with the above block-length dependent thresholds A 
and )S n \ respectively. The moderate deviations analysis for binary hypothesis testing studies the probability of an 
error event and also the probability of the event of either making an erroneous decision or making no decision 
(i.e., declaring an erasure) under the two hypotheses. Particularly, we also study the asymptotic scaling of these 
probability under either Hi and H2 when simultaneously the block length of the input sequence n tends to infinity, 
and the thresholds A*™"* and A*-" - * 1 tend to D{Pi\\P2) and — D(P2\\P\), respectively (which are the asymptotic limits 
in (90) and (91), respectively, when the block length tends to infinity). 

Before proceeding to the moderate deviations analysis of binary hypothesis testing, the related literature in the 
context of information-theoretic problems is shortly reviewed. The moderate deviations analysis in the context of 
source and channel coding has recently attracted some interest among information theorists (see [1], [4], [32], [56] 
and [76]). 

Moderate deviations were analyzed in [1, Section 4.3] for a channel model that gets noisier as the block length 
is increased. Due to the dependence of the channel parameter in the block length, the usual notion of capacity 
for these channels is zero. Hence, the issue of increasing the block length for the considered type of degrading 
channels was examined in [1, Section 4.3] via moderate deviations analysis when the number of codewords increases 
sub-exponentially with the block length. In another recent work [4], the moderate deviations behavior of channel 
coding for discrete memoryless channels was studied by Altug and Wagner with a derivation of direct and converse 
results which explicitly characterize the rate function of the moderate deviations principle (MDP). In [4], the authors 
studied the interplay between the probability of error, code rate and block length when the communication takes 
place over discrete memoryless channels, having the interest to figure out how the decoding error probability of the 
best code scales when simultaneously the block length tends to infinity and the code rate approaches the channel 
capacity. The novelty in the setup of their analysis was the consideration of the scenario mentioned above, in 
contrast to the case where the rate is kept fixed below capacity, and the study is reduced to a characterization of 
the dependence between the two remaining parameters (i.e., the block length n and the average/ maximal error 
probability of the best code). As opposed to the latter case when the code rate is kept fixed, which then corresponds 
to large deviations analysis and characterizes the error exponents as a function of the rate, the analysis in [4] (via 
the introduction of direct and converse theorems) demonstrated a sub-exponential scaling of the maximal error 
probability in the considered moderate deviations regime. This work was followed by a work by Polynaskiy and 
Verdti where they show that a DMC satisfies the MDP if and only if its channel dispersion is non-zero, and also 
that the AWGN channel satisfies the MDP with a constant that is equal to the channel dispersion. The approach 
used in [4] was based on the method of types, whereas the approach used in [57] borrowed some tools from a 
recent work by the same authors in [56]. 

In [32], the moderate deviations analysis of the Slepian-Wolf problem for lossless source coding was studied. 
More recently, moderate deviations analysis for lossy source coding of stationary memoryless sources was studied 
in [76]. 

These works, including the following discussion, indicate a recent interest in moderate deviations analysis in the 
context of information-theoretic problems. In the literature on probability theory, the moderate deviations analysis 
was extensively studied (see, e.g., [21, Section 3.7]), and in particular the MDP was studied in [20] for continuous- 
time martingales with bounded jumps. 

In light of the discussion in Section V-C on the MDP for i.i.d. RVs and its relation to the concentration inequalities 
in Section IV (see Appendix G), and also motivated by the recent works on moderate-deviations analysis for 
information-theoretic aspects, we consider in the following moderate deviations analysis for binary hypothesis 
testing. Our approach for this kind of analysis is different from [4] and [57], and it relies on concentration inequalities 
for martingales. The material in the following was presented in part in [67]. 
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In the following, we analyze the probability of a joint error and erasure event under hypothesis Hi, i.e., derive 
an upper bound on a4 1} in (92). The same kind of analysis can be adapted easily for the other probabilities in (93)- 
(95). As mentioned earlier, let E\ > and r) £ 1) be two arbitrarily fixed numbers. Then, under hypothesis Hi, 
it follows that similarly to (1 16)— (1 18) 

P?{L(X u ...,X n )<n\ {n) ) 



P?{L{Xi, ...,X n )< nD{Pi\\P 2 ) - em"!) 



< exp —nD 



+ 7i 



7i 



where 



(77,n) A_ £\Tl 



1 + 71 

-(1-1) 



l + 7i 



di 



^ = 12 



(149) 



(150) 



with di and a\ from (112) and (113). From (139), (140) and (150), it follows that 



D 



7i 



0\ + 71 

l + 7i l + 7i 



7i 



> 



l + 7i 

7i 
l + 7i 

(4"' n) ) 2 
271 



V 7l J 



In 1 + 



7i 



(l-<5^ n) )ln(l-<^' nj ) 



7i 



7i 



+ 



2 7 f 



6^1 



1 

7i 



(»?,«) \2 



37l(l + 7l) 



gfn- 2 ^) / eidi 1 

2<r 2 V 3<^(1 + 71) n 1 -" 



provided that < 1 (which holds for n > no for some no = no(r?, £1, <ii) G N that is determined from (150)). 

By substituting this lower bound on the divergence into (149), it follows that 

c$ =P?(L(X u ...,X n ) <nD{Pi\\P 2 )-ein^) 



< exp 



2g\ 



eidi 



3^(1 + 7!) n 1 - 1 ? 



Consequently, in the limit where n tends to infinity, 



lim n 1 " 2 " lnal 1 ) < 



2(7? 



(151) 



(152) 



with a\ in (113). From the analysis in Section V-C and Appendix G, the following things hold: 

• The inequality for the asymptotic limit in (152) holds in fact with equality. 

• The same asymptotic result also follows from Theorem 4 for every even-valued m > 2 (instead of Theorem 3). 
To verify these statements, consider the real-valued sequence of i.i.d. RVs 

Y -- ,n {9M)- DiPim> - , - i -- n 

that, under hypothesis Hi, have zero mean and variance a\. Since, by assumption, {^Q}f =1 are i.i.d., then 

n 

L(Xi, ...,X n )- nD(Pi\\P 2 ) = £ Y u (153) 

i=i 

and it follows from the one-sided version of the MDP in (82) that indeed (152) holds with equality. Moreover, 
Theorem 3 provides, via the inequality in (151), a finite-length result that enhances the asymptotic result for n — > 00. 
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The second item above follows from the second part of the analysis in Appendix G (i.e., the part of analysis in 
this appendix that follows from Theorem 4). 

In the considered setting of moderate deviations analysis for binary hypothesis testing, the upper bound on the 
probability in (151), which refers to the probability of either making an error or declaring an erasure (i.e., 
making no decision) under the hypothesis Hi, decays to zero sub-exponentially with the length n of the sequence. 
As mentioned above, based on the analysis in Section V-C and Appendix G, the asymptotic upper bound in (152) is 
tight. A completely similar moderate-deviations analysis can be also performed under the hypothesis H2. Hence, a 
sub-exponential scaling of the probability fffl in (94) of either making an error or declaring an erasure (where the 
lower threshold A is replaced with )S n ^) also holds under the hypothesis H 2 . These two sub-exponential decays to 

zero for the probabilities ot$ and (3n \ under hypothesis Hi or H2 respectively, improve as the value of 77 G (5, 1) 

(2) (2) 

is increased. On the other hand, the two exponential decays to zero of the probabilities of error (i.e., q„ and /% 
under hypothesis Hi or H2, respectively) improve as the value of 77 G (|, 1) is decreased; this is due to the fact 
that, for a fixed value of n, the margin which serves to protect us from making an error (either under hypothesis 
Hi or Hi) is increased by decreasing the value of r\ as above (note that by reducing the value of rj for a fixed n, 
the upper and lower thresholds A^ and A*™) are made closer to D(Pi\\P2) from below and to —D(P2\\Pi) from 
above, respectively, which therefore increases the margin that is used for protecting one from making an erroneous 
decision). This shows the existence of a tradeoff, in the choice of the parameter 77 G (|, 1), between the probability 
of error and the joint probability of error and erasure under either hypothesis Hi or H2 (where this tradeoff exists 
symmetrically for each of the two hypotheses). 

In [4] and [57], the authors consider moderate deviations analysis for channel coding over memoryless channels. In 
particular, [4, Theorem 2.2] and [57, Theorem 6] indicate on a tight lower bound (i.e., a converse) to the asymptotic 
result in (152) for binary hypothesis testing. This tight converse is indeed consistent with the asymptotic result of 
the MDP in (82) for real-valued i.i.d. random variables, which implies that the asymptotic upper bound in (152), 
obtained via the martingale approach with the refined version of Azuma's inequality in Theorem 3, holds indeed 
with equality. Note that this equality does not follow from Azuma's inequality, so its refinement was essential for 
obtaining this equality. The reason is that, due to Appendix G, the upper bound in (152) that is equal to — is 

replaced via Azuma's inequality by the looser bound — ^ (note that, from (112) and (113), o\ < di where <ti 
may be significantly smaller than di). 

6) Second-Order Analysis for Binary Hypothesis Testing: The moderate deviations analysis in the previous sub- 
section refers to deviations that scale like n v for r\ G (5,1). Let us consider now the case of 77 = \ which 
corresponds to small deviations. To this end, refer to the real-valued sequence of i.i.d. RVs {Y;}™ =1 with zero 
mean and variance a\ (under hypothesis Hi), and define the partial sums = Yli=i ^» for A; G {1, ... , n} with 
So = 0. This implies that {Sj-, Fk}k=o * s a martingale sequence. At this point, it links the current discussion on 
binary hypothesis testing to Section V-A which refers to the relation between the martingale CLT and Proposition 4. 
Specifically, since from (153), 

S n - So = L(Xi, ...,X n )- nD(Pi\\P 2 ) 
then from the proof of Proposition 4, one gets an upper bound on the probability 

P?{L{X U ...,X n )< nD{Pi\\P 2 ) - ei^i) 

for a finite block length n (via an analysis that is either related to Theorem 3 or 4) which agrees with the asymptotic 
result 

lim ]nP?(L(X 1 ,...,X n ) < nD{Pi\\P 2 ) - EiV^) = - A- 

Referring to small deviations analysis and the CLT, it shows a duality between these kind of results and recent 
works on second-order analysis for channel coding (see [33], [56], and [58], where the variance o\ in (113) is 
replaced with the channel dispersion that is defined to be the variance of the mutual information RV between the 
channel input and output, and is a property of the communication channel solely). 
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B. Pairwise Error Probability for Linear Block Codes over Binary-Input Output-Symmetric DMCs 

In this sub-section, the tightness of Theorems 3 and 4 is studied by the derivation of upper bounds on the 
pairwise error probability under maximum-likelihood (ML) decoding when the transmission takes place over a 
discrete memoryless channel (DMC). 

Let C be a binary linear block code of block length n, and assume that the codewords are a-priori equi-probable. 
Consider the case where the communication takes place over a binary-input output-symmetric DMC whose input 
alphabet is X = {0, 1}, and its output alphabet y is finite. 

In the following, boldface letters denote vectors, regular letters with sub-scripts denote individual elements of 
vectors, capital letters represent RVs, and lower-case letters denote individual realizations of the corresponding RVs. 
Let 

n 

Py\x.(v\x) = Yl p Y\x(yi\xi) 

1=1 

be the transition probability of the DMC, where due to the symmetry assumption 

Pr\x(v\0) = Py\x(-v\l), Vy G y. 

It is also assumed in the following that Py\x(y\ x ) > f° r every (x,y) G X x y. Due to the linearity of the code 
and the symmetry of the DMC, the decoding error probability is independent of the transmitted codeword, so it is 
assumed without any loss of generality that the all-zero codeword is transmitted. In the following, we consider the 
pairwise error probability when the competitive codeword x G C has a Hamming weight that is equal to h, and 
denote it by Wn(x) = h. Let Py denote the probability distribution of the channel output. 

In order to derive upper bounds on the pairwise error probability, let us define the following two hypotheses: 

. Hi : Py(y) = Uti Py\x(Vi\0), Vy G y n , 

• H 2 : Py(y) = niLi Pr\x(Vi\xi), Vy G y n 
which correspond, respectively, to the transmission of the all-zero codeword and the competitive codeword x G C. 
Under hypothesis Hi, the considered pairwise error event under ML decoding occurs if and only if 

U V Py\xiVi\0) J ~ 

Let {ik}k=i be the h indices of the coordinates of x where xi = 1, ordered such that 1 < %i < . . . < < n. 
Based on this notation, the log-likelihood ratio satisfies the equality 

f m ( p y\^) = y ln ( p Y\x(y^\) . (154) 

^ \Pr\x{Vi\0)J £i \PY\x(yijO)J 

For the continuation of the analysis in this sub-section, let us define the martingale sequence {Uk,Ti}^ =0 with 
the filtration 



•Ffc = cr(Y il ,...,Y ik ), k = l,...,h 



and, under hypothesis Hi, let 



U k = E 



PY\x(Y im \l) 



\ \PY\x(Y im \0) 



, Vfc€{0,l,...,/i}. 
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Since, under hypothesis Hi, the RVs , . . . , Yi h are statistically independent, then for k G {0, 1, . . . , h} 



m=l 



Py\x(Xi Jl) 
(YijO) 



Y\X 



+(h-k)J2PY\x(y\0)ln 

y&y 

'P Ylx (Y lm \l) 



Y\X( 



Y\X 



£ ln V^ix(^jo)^ 

-(h-k) D(P Y \ x {-\0) ||Py| X (-|l)). 



Specifically 



U 



-hD(P YlX (-\0)\\Py lX (-\l)) 

Py\x(Yi\xi)\ 



1 , TV 



(155) 

(156) 
(157) 



where the last equality follows from (154) and (155), and the differences of the martingale sequence are given by 



P Y \x(Y lk \l) 



In 



+ D(iV|x(-|0)||iV|x(-|l)) 



(158) 



PY\x(Y ik \0) 

for every k G {1, . . . , h}. Note that, under hypothesis Hi, indeed Ef^jJfc-i] = 0. 

The probability of a pairwise error event, where the ML decoder prefers a competitive codeword x G C (Wn(x) = 
h) over the transmitted all-zero codeword, is equal to 

P h 4 F(U h >0\Hi) 

= F(u h -U >hD(P Ylx (-\0)\\P Ylx (-\l)) | HiY (159) 

It therefore follows that a.s. for every k G {1, . . . , h} 



\£k\< max 

y&y 



In 

= d < oo 



Y\X 



Y\X 



(y\o)J 



+ D(Py lX (-\0)\\Py lX (-\l)) 



(160) 



which is indeed finite since, by assumption, the alphabet y is finite and P Y \ x (y\x) > for every (x, y) G X x y. 
Note that, in fact, taking an absolute value in the maximization of the logarithm on the right-hand side of (160) is 
redundant due to the channel symmetry, and also due to the equality Py\x{y\fy = Yl y ^V|x(?/|l) = 1 ( so th at 
it follows, from this equality, that there exists an element y G y such that P Y \ x (y\l) > P Y \ x (y\0)). 

As an interim conclusion, {£4)Pfe}fc=o i s a martingale sequence with bounded jumps, and \U^ — Uk-i\ < d 
holds a.s. for every k G {1, . . . , h}. We rely in the following on the concentration inequalities of Theorems 3 and 4 
to obtain, via (158)— (160), upper bounds on the pairwise error probability. The tightness of these bounds will be 
examined numerically, and they will be compared to the Bhattacharyya upper bound. 

1) Analysis Related to Theorem 3: From (158), for every k G {1, . . . , h} 



At— 11 



^2 P Y\X 

yey 



(y|o) 
Y\x(y\o) 



In 



In 



Py\x(v\l) \ 
P Y \x(y\o)J 

Py\x(v\1) \ 

PY\x(y\o)J 

2 



+ D(Py\ X (-\0)\\Py\ X (-\l)) 



1 2 



-[d(p yix (.\o)\\p yix (-\i)) 



^a 2 



(161) 
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holds a.s., where the last equality follows from the definition of the divergence (relative entropy). Based on (159) 
and the notation in (29), let 



D{P YlX (-\0)\\Py lX (-\l)) 
7 d 2 ' d 



(162) 



where d and a 2 are introduced in (160) and (161), respectively. Under hypothesis Hi, one gets from (159) and 
Theorem 3 that the pairwise error probability satisfies the upper bound 



Ph. < Z? 



where 



exp ( —D 



7 



(5 + 7 
1 + 7 " 1 + 7 



and 7, S are introduced in (162). 

In the following, we compare the exponential bound in (163) with the Bhattacharyya bound 

p h < 4 

where the Bhattacharyya parameter Z B of the binary-input DMC is given by 



(163) 



(164) 



(165) 



■'B 



^y/P Y \x(y\o)PY\x(y\i)- (166) 

Example 6: Consider a binary symmetric channel (BSC) with crossover probability p. The Bhattacharyya pa- 
rameter which corresponds to this channel is Z B = \/4p(l — p). In the following, Zi from (164) is calculated for 
comparison. Without loss of generality, assume that p <\- Straightforward calculation shows that 

1 



p 



P 



In 



1 — p 

P 



1 2 



d = 2(1 -p) In 
a 2 = Ap(l -p) 
D(P Ylx (-\0)\\P Ylx (-\l)) = (l-2p)ln 



(^) 



and therefore (162) gives that 



7 



p 



5 = 



1 - 2p 



1-p 7 2(1 -p) 

Substituting 7 and 5 into (164) gives that the base of the exponential bound in (163) is equal to 



Zi = exp (-^(tHIp)) = VM1-P) 



which coincides with the Bhattacharyya parameter for the BSC. This shows that, for the BSC, Theorem 3 implies 
the Bhattacharyya upper bound on the pairwise error probability. 

In general, it is observed numerically that Zi > Z B for binary-input output-symmetric DMCs with an equality 
for the BSC (this will be exemplified after introducing the bound on the pairwise error probability which follows 
from Theorem 4). This implies that Theorem 3 yields in general a looser bound than the Bhattacharyya upper 
bound in the context of the pairwise error probability for DMCs. 

2 ) Analysis Related to Theorem 4: In the following, a parallel upper bound on the pairwise error probability is 
derived from Remark 12 on Theorem 4, and the martingale sequence {Uk, ^"fcj^o- Under hypothesis Hi (i.e., the 
assumption that the all-zero codeword is transmitted), (158) implies that the conditional expectation of (Uk — U^-i) 1 
given Tk-i is equal (a.s.) to the un-conditional expectation where / is an arbitrary natural number. Also, it follows 
from (158) that for every k €. {1, . . . ,h} and I G N 



= (-1) 



In 



Py\x(Y\0) ^ 
p y\x 



^y) -£>(Py|x(-|0)||Py|x(-|l))) 
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TABLE I 

The bases of the exponential bounds Z\ and Z^ m) in (163) and (168) (for an even-valued m > 2), respectively. The 

BASES OF THESE EXPONENTIAL BOUNDS ARE COMPARED TO THE BHATTACHARYYA PARAMETER Zb IN (166) FOR THE FIVE DMC 

CHANNELS IN (169) WITH p = 0.04 AND \y\ = Q = 2, 3, 4, 5, 10. 



Q 


2 


3 


4 


5 


10 


Zb 


0.3919 


0.4237 


0.4552 


0.4866 


0.6400 


Zi 


0.3919 


0.4424 


0.4879 


0.5297 


0.7012 


z?> 


0.3967 


0.4484 


0.4950 


0.5377 


0.7102 


z^ 


0.3919 


0.4247 


0.4570 


0.4887 


0.6421 


zf 


0.3919 


0.4237 


0.4553 


0.4867 


0.6400 




0.3919 


0.4237 


0.4552 


0.4866 


0.6400 


4 10) 


0.3919 


0.4237 


0.4552 


0.4866 


0.6400 



and, based on Remark 12, let 

H = (-l)'E 



K^)-MWI0)IIWW) 



(167) 



for every I G N (for even-valued I, there is no need to take the maximization with zero). Based on the notation 
used in the context of Remark 12, let 

Jl = -j, 1 = 2,3,... 



d 1 ' 

and 5 be the same parameter as in (162). Note that the equality 72 
Remark 12 on Theorem 4 yields that for every even- valued m > 2 



7 holds for the parameter 7 in (162). Then, 



(168) 



where 



4 m) 4 inf 

z x>0 



-8x 



m— 1 



1=2 



(7j ~ 7m) x l 



+ 7m(e a 



Example 7: In the following example, the bases of the two exponential bounds on the pairwise error probability 
in (163) and (168) are compared to the corresponding Bhattacharyya parameter (see (166)) for some binary-input 
output-symmetric DMCs. 

For a integer-valued Q > 2, let P^ x be a binary-input output-symmetric DMC with input alphabet X = {0, 1} 
and output alphabet y = {0, 1, . . . , Q — 1}, characterized by the following probability transitions: 

P<?> (0|0) = p^Uq - 1|1) = 1 - (Q - l)p, 



Y\X 
Y\X 

P (Q) 

r Y\X 



P^(1|0) = ... = P, 



- p (Q) (Q 



(0|1) 



Y\X^ 

,(Q) , 

Y\X [ 



P$l{Q 



1|0) 
2|1) 



P 
P 



(169) 



where < p < qty- The considered exponential bounds are exemplified in the following for the case where 
p = 0.04 and Q = 2, 3, 4, 5, 10. The bases of the exponential bounds in (163) and (168) are compared in Table I 
to the corresponding Bhattacharyya parameters of these five DMCs that, from (166), is equal to 



Z B = 2Jp [1 - (Q - l)p] +(Q- 2)p. 



(2) 

As is shown in Table I, the choice of m = 2 gives the worst upper bound in Theorem 4 (since > Z. 
for every even-valued m > 2). This is consistent with Corollary 3. Moreover, the comparison of the third and 
forth lines in Theorem 4 is consistent with Proposition 2 which indeed assures that Theorem 4 with m = 2 is 
looser than Theorem 3 (hence, indeed Z\ < Z% for the considered DMCs). Also, from Example 6, it follows 
that Theorem 3 coincides with the Battacharyya bound (hence, Z\ = Z B for the special case where Q = 2, as is 



,(m) 
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TABLE II 

The base of the exponential bound in (163) and its (tight) upper bound Z^ that follows by replacing the 

INFIMUM OPERATION BY THE SUB-OPTIMAL VALUE IN (64) AND (65). THE FIVE DMCS ARE THE SAME AS IN (169) AND TABLE I. 



Q 


2 


3 


4 


5 


10 




0.3919 


0.4237 


0.4552 


0.4866 


0.6400 


zr 


0.3919 


0.4237 


0.4553 


0.4868 


0.6417 



indeed verified numerically in Table I). It is interesting to realize from Table I that for the five considered DMCs, 
the sequence {Z^\ z!f \ Z^\ . . .} converges very fast, and the limit is equal to the Bhattacharyya parameter for 
all the examined cases. This stays in contrast to the exponential base Z\ that was derived from Theorem 3, and 
which appears to be strictly larger than the corresponding Bhattacharyya parameter of the DMC (except for the 
BSC, where the equality Z\ = Z B holds, as is shown in Example 6). 

Example 7 leads to the following conjecture: 

Conjecture 1: For the martingale sequence {Uk,J-'k}k=o introduced in this sub-section, 

lim Z^ l) = Z B 

m— >co 

and this convergence is quadratic. 

Example 8: The base Z^ of the exponential bound in (168) involves an operation of taking an infimum over 
the interval [0, oo). This operation is performed numerically in general, except for the special case where m = 2 
for which a closed-form solution exists (see Appendix B for the proof of Corollary 4). 

Replacing the infimum over x £ [0, oo) with the sub-optimal value of x in (64) and (65) gives an upper bound 
on the respective exponential base of the bound (note that due to the analysis, this sub-optimal value turns to be 
optimal in the special case where m = 2). The upper bound on Z^ which follows by replacing the infimum with 
the sub-optimal value in (64) and (65) is denoted by Z^, and the difference between the two values is marginal 
(see Table II). 



C. Minimum Distance of Binary Linear Block Codes 

Consider the ensemble of binary linear block codes of length n and rate R. The average value of the normalized 
minimum distance is equal to 

E^,.(C)] = , 
n 1 

where 1 designates the inverse of the binary entropy function to the base 2, and the expectation is with respect 
to the ensemble where the codes are chosen uniformly at random (see [8]). 

Let H designate ann(l-R)xn parity-check matrix of a linear block code C from this ensemble. The minimum 
distance of the code is equal to the minimal number of columns in H that are linearly dependent. Note that the 
minimum distance is a property of the code, and it does not depend on the choice of the particular parity-check 
matrix which represents the code. 

Let us construct a martingale sequence Xq, . . . , X n where Xi (for i = 0, 1, . . . , n) is a RV that denotes the 
minimal number of linearly dependent columns of a parity-check matrix that is chosen uniformly at random from 
the ensemble, given that we already revealed its first i columns. Based on Remarks 2 and 3, this sequence forms 
indeed a martingale sequence where the associated filtration of the cr-algebras Fq C T\ C . . . C T n is defined so 
that T\ (for i = 0, 1, . . . , n) is the u-algebra that is generated by all the sub-sets of n(l — R) x n binary parity-check 
matrices whose first i columns are fixed. This martingale sequence satisfies \Xi — < 1 for i = 1, . . . , n (since 

if we reveal a new column of H, then the minimal number of linearly dependent columns can change by at most 1). 
Note that the RV Xq is the expected minimum Hamming distance of the ensemble, and X n is the minimum distance 
of a particular code from the ensemble (since once we revealed all the n columns of H, then the code is known 
exactly). Hence, by Azuma's inequality 

P(|dmin(C) - E[d min (C)]\ > a^fn) < 2exp (-^J ,V«>0. 
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This leads to the following theorem: 

Theorem 6: [The minimum distance of binary linear block codes] Let C be chosen uniformly at random from 
the ensemble of binary linear block codes of length n and rate R. Then for every a > 0, with probability at least 
1 — 2 exp ( — ^ J , the minimum distance of C is in the interval 



and it therefore concentrates around its expected value. 

Note, however, that some well-known capacity-approaching families of binary linear block codes possess a 
minimum Hamming distance which grows sub-linearly with the block length n. For example, the class of parallel 
concatenated convolutional (turbo) codes was proved to have a minimum distance which grows at most like the 
logarithm of the interleaver length [14]. 

D. Concentration of the Cardinality of the Fundamental System of Cycles for LDPC Code Ensembles 

Low-density parity-check (LDPC) codes are linear block codes that are represented by sparse parity-check 
matrices [29]. A sparse parity-check matrix enables to represent the corresponding linear block code by a sparse 
bipartite graph, and to use this graphical representation for implementing low-complexity iterative message-passing 
decoding. The low-complexity decoding algorithms used for LDPC codes and some of their variants are remarkable 
in that they achieve rates close to the Shannon capacity limit for properly designed code ensembles (see, e.g., 
[61]). As a result of their remarkable performance under practical decoding algorithms, these coding techniques 
have revolutionized the field of channel coding and they have been incorporated in various digital communication 
standards during the last decade. 

In the following, we consider ensembles of binary LDPC codes. The codes are represented by bipartite graphs 
where the variable nodes are located on the left side of the graph, and the parity-check nodes are on the right. 
The parity-check equations that define the linear code are represented by edges connecting each check node with 
the variable nodes that are involved in the corresponding parity-check equation. The bipartite graphs representing 
these codes are sparse in the sense that the number of edges in the graph scales linearly with the block length n of 
the code. Following standard notation, let Aj and pi denote the fraction of edges attached, respectively, to variable 
and parity-check nodes of degree i. The LDPC code ensemble is denoted by LDPC(n, A, p) where n is the block 
length of the codes, and the pair \{x) = Yli^ 1-1 an d P( x ) — Yli Pi x%1 represents, respectively, the left and 
right degree distributions of the ensemble from the edge perspective. For a short summary of preliminary material 
on binary LDPC code ensembles see, e.g., [64, Section II- A]. 

It is well known that linear block codes which can be represented by cycle-free bipartite (Tanner) graphs have 
poor performance even under ML decoding [25]. The bipartite graphs of capacity-approaching LDPC codes should 
therefore have cycles. For analyzing this issue, we focused on the notion of "the cardinality of the fundamental 
system of cycles of bipartite graphs". For the required preliminary material, the reader is referred to [64, Section II- 
E]. In [64], we address the following question: 

Question: Consider an LDPC ensemble whose transmission takes place over a memory less binary-input output 
symmetric channel, and refer to the bipartite graphs which represent codes from this ensemble where every code 
is chosen uniformly at random from the ensemble. How does the average cardinality of the fundamental system of 
cycles of these bipartite graphs scale as a function of the achievable gap to capacity ? 

In light of this question, an information-theoretic lower bound on the average cardinality of the fundamental 
system of cycles was derived in [64, Corollary 1]. This bound was expressed in terms of the achievable gap to 
capacity (even under ML decoding) when the communication takes place over a memoryless binary-input output- 
symmetric channel. More explicitly, it was shown that if e designates the gap in rate to capacity, then the number 
of fundamental cycles should grow at least like log i. Hence, this lower bound remains unbounded as the gap to 
capacity tends to zero. Consistently with the study in [25] on cycle-free codes, the lower bound on the cardinality 
of the fundamental system of cycles in [64, Corollary 1] shows quantitatively the necessity of cycles in bipartite 
graphs which represent good LDPC code ensembles. As a continuation to this work, we present in the following 
a large-deviations analysis with respect to the cardinality of the fundamental system of cycles for LDPC code 
ensembles. 




[n h 2 1 (1 — R) — aty/n, n h 2 1 



(1-R) + a^M 
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Let the triple (re, A, p) represent an LDPC code ensemble, and let Q be a bipartite graph that corresponds to a 
code from this ensemble. Then, the cardinality of the fundamental system of cycles of Q, denoted by f3(Q), is equal 
to 

P(G) = \E(G)\-\V(9)\+c(G) 

where E(Q), V(G) and c(Q) denote the edges, vertices and components of Q, respectively, and \A\ denotes the 
number of elements of a (finite) set A. Note that for such a bipartite graph Q, there are n variable nodes and 
m = n(l — Rd) parity-check nodes, so there are in total = n(2 — R d ) nodes. Let or designate the average 

right degree (i.e., the average degree of the parity-check nodes), then the number of edges in Q is given by 
= maR. Therefore, for a code from the (re, A, p) LDPC code ensemble, the cardinality of the fundamental 
system of cycles satisfies the equality 

= re [(1 - R d )a R - (2 - R d )] + c(G) (170) 

where 

J Q p(x) dx 1 

i?d = 1 j , Or = — j 

Jo x ( x ) dx J p(x) dx 

denote, respectively, the design rate and average right degree of the ensemble. 
Let 

E = \E(Q)\ = n(l — i?d) a R (171) 

denote the number of edges of an arbitrary bipartite graph Q from the ensemble (where we refer interchangeably 
to codes and to the bipartite graphs that represent these codes from the considered ensemble). Let us arbitrarily 
assign numbers 1, . . . , E to the E edges of Q. Based on Remarks 2 and 3, lets construct a martingale sequence 
Xq, . . . , Xe where Xi (for % = 0, 1, . . . , E) is a RV that denotes the conditional expected number of components 
of a bipartite graph Q, chosen uniformly at random from the ensemble, given that the first i edges of the graph Q 
are revealed. Note that the corresponding filtration Tq C T\ C . . . C Te in this case is defined so that Ti is the 
cr-algebra that is generated by all the sets of bipartite graphs from the considered ensemble whose first i edges are 
fixed. For this martingale sequence 

^0=E L DPC(n,A,p)[/ 5 (^)]> Xe = P{G) 

and (a.s.) \X/, — Xj,_i\ < 1 for k = 1, . . . , E (since by revealing a new edge of Q, the number of components in 
this graph can change by at most 1). By Corollary 2, it follows that for every a > 

P (\c(G) - E LDPC(n , M [c(S)]| > otE) < 2e~^ E 

P (\(3(G) - E LDPC(n)AiP) [/3(g)]| > aE) < 2e-^ E (172) 

where the last transition follows from (170), and the function / was defined in (45). Hence, for a > 1, this 
probability is zero (since f(a) = +oo for a > 1). Note that, from (170), Eldpc(« ,a ,p) scales linearly with n. 

The combination of Eqs. (45), (171), (172) gives the following statement: 

Theorem 7: [Concentration inequality for the cardinality of the fundamental system of cycles] Let LDPC(n, A, p) 
be the LDPC code ensemble that is characterized by a block length n, and a pair of degree distributions (from the 
edge perspective) of A and p. Let Q be a bipartite graph chosen uniformly at random from this ensemble. Then, 
for every a > 0, the cardinality of the fundamental system of cycles of Q, denoted by @(G), satisfies the following 
inequality: 

V{\m-®LDVC(n,X, P )\P(G)]\ >H <2.2-M^)]" 

where h 2 designates the binary entropy function to the base 2, rj = ( 1 _^ 1 ) aR » and R d and or designate, respectively, 
the design rate and average right degree of the ensemble. Consequently, if rj > 1, this probability is zero. 
Remark 17: The loosened version of Theorem 7, which follows from Azuma's inequality, gets the form 

P (\P(G) - E LDPC(niA)P) [/?(£)] | > an) < 

for every a > 0, and rj as defined in Theorem 7. Note, however, that the exponential decay of the two bounds is 
similar for values of a close to zero (see the exponents in Azuma's inequality and Corollary 2 in Figure 1). 
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Remark 18: For various capacity-achieving sequences of LDPC code ensembles on the binary erasure channel, 
the average right degree scales like log | where e denotes the fractional gap to capacity under belief-propagation 
decoding (i.e., R d = (1 — e)C) [45]. Therefore, for small values of a, the exponential decay rate in the inequality 
of Theorem 7 scales like (log ^) . This large-deviations result complements the result in [64, Corollary 1] which 
provides a lower bound on the average cardinality of the fundamental system of cycles that scales like log |. 

Remark 19: Consider small deviations from the expected value that scale like ^fn. Note that Corollary 2 is a 
special case of Theorem 3 when 7 = 1 (i.e., when only an upper bound on the jumps of the martingale sequence is 
available, but there is no non-trivial upper bound on the conditional variance). Hence, it follows from Proposition 4 
that Corollary 2 does not provide in this case any improvement in the exponent of the concentration inequality (as 
compared to Azuma's inequality) when small deviations are considered. 

E. Performance of LDPC Codes under Iterative Message-Passing Decoding 

In the following, we consider ensembles of binary LDPC codes. Following standard notation, let \ and pi 
denote the fraction of edges attached, respectively, to variable and parity-check nodes of degree i. The LDPC 
code ensemble that is denoted by LDPC(n, A, p) is characterized by the block length n of the codes, and the pair 
A(x) = J2i and p(x) = X^/°* x * _1 which represent, respectively, the left and right degree distributions from 

the edge perspective. 

The following theorem was proved in [61, Appendix C] based on Azuma's inequality: 

Theorem 8: [Concentration of the bit error probability around the ensemble average] Let C, a code chosen 
uniformly at random from the ensemble LDPC(n,A,p), be used for transmission over a memory less binary-input 
output-symmetric (MBIOS) channel characterized by its L-density ombios- Assume that the decoder performs I 
iterations of message-passing decoding, and let P\,(C, ombios, denote the resulting bit error probability. Then, for 
every 5 > 0, there exists an a > where a = a(\, p, S, I) (independent of the block length n) such that 

F(\P b (C, a M Bios, - E LDPC ( n)A)P )[Pb(C,aMBios,0]l > $) < exp(-an) 

This theorem asserts that all except an exponentially (in the block length) small fraction of codes behave within an 
arbitrary small 5 from the ensemble average (where 5 is a positive number that can be chosen arbitrarily small). 
Therefore, assuming a sufficiently large block length, the ensemble average is a good indicator for the performance 
of individual codes, and it is therefore reasonable to focus on the design and analysis of capacity-approaching 
ensembles (via the density evolution technique). This theorem is proved in [61, pp. 487-490] based on Azuma's 
inequality. 

F. On the Conditional Entropy for LDPC Code Ensembles 

A large-deviation analysis of the conditional entropy for random ensembles of LDPC codes was introduced in 
[52, Theorem 4] and [54, Theorem 1]. The following theorem is proved in [52, Appendix I] based on Azuma's 
inequality: 

Theorem 9: [Large deviations of the conditional entropy] Let C be chosen uniformly at random from the 
ensemble LDPC(n, A, p). Assume that the transmission of the code C takes place over an MBIOS channel. Let 
H(X.\Y) designate the conditional entropy of the transmitted codeword X given the received sequence Y from 
the channel. Then for any £ > 0, 

Pfltf (X|Y) - E LDPC(niM [tf(X|Y)]| > nO < 2exp(-ni?a 

where B = 2 [d" i ^+i) 2 (i-R d ) ' d-c 3 * * s maximal check-node degree, and is the design rate of the ensemble. 
The conditional entropy scales linearly with n, and this inequality considers deviations from the average which also 
scale linearly with n. 

In the following, we revisit the proof of Theorem 9 in [52, Appendix I] in order to derive a tightened version 
of this bound. Based on this proof, let Q be a bipartite graph which represents a code chosen uniformly at random 
from the ensemble LDPC(n, A, p). Define the RV 



Z = Hg{X.\Y) 
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which forms the conditional entropy when the transmission takes place over an MBIOS channel whose transition 
probability is given by P Y |x(y|x) = Hi=iPY\x(yi\xi) where Py\x(v\1) = PY\x(-y\ty- Fix an arbitrary order 
for the m = n(l — Rd) parity-check nodes where Rd forms the design rate of the LDPC code ensemble. Let 
{-Pt}t e {o i m} f° rm a filtration of cr-algebras To C T\ C . . . C J m where Ft (for t = 0,1,..., m) is the 
cr-algebra that is generated by all the sub-sets of m x n parity-check matrices that are characterized by the pair 
of degree distributions (A, p) and whose first t parity-check equations are fixed (for t = nothing is fixed, and 
therefore Fq = {$, fi} where denotes the empty set, and is the whole sample space ofmxn binary parity-check 
matrices that are characterized by the pair of degree distributions (A, p)). Accordingly, based on Remarks 2 and 3, 
let us define the following martingale sequence 

Z t = E[Z\T t ] i€{0,l,...,m}. 

By construction, Zq = E[Hg(X.\Y)] is the expected value of the conditional entropy for the LDPC code ensemble, 
and Z m is the RV that is equal (a.s.) to the conditional entropy of the particular code from the ensemble (see 
Remark 3). Similarly to [52, Appendix I], we obtain upper bounds on the differences \Z t +\ — Z t \ and then rely on 
Azuma's inequality in Theorem 1. 

Without loss of generality, the parity-checks are ordered in [52, Appendix I] by increasing degree. Let r = 
(ri,r2, . . .) be the set of parity-check degrees in ascending order, and Tj be the fraction of parity-check nodes of 
degree i. Hence, the first mi = n(l — Rd)F ri parity-check nodes are of degree r±, the successive rri2 = n(l — i?d)rV 2 
parity-check nodes are of degree r-i, and so on. The (t + l)th parity-check will therefore have a well defined degree, 
to be denoted by r. From the proof in [52, Appendix I] 

\Z t+1 -Z t \<(r + l)Hg(X\Y) (173) 

where Hg(X\Y) is a RV which designates the conditional entropy of a parity-bit X = X^ © ... © JQ r (i.e., X 
is equal to the modulo-2 sum of some r bits in the codeword X) given the received sequence Y at the channel 
output. The proof in [52, Appendix I] was then completed by upper bounding the parity-check degree r by the 
maximal parity-check degree c^ nax , and also by upper bounding the conditional entropy of the parity -bit X by 1. 
This gives 

\Zt+i -Z t \ < < nax + l i = 0,l,...,m-l. (174) 

which then proves Theorem 9 from Azuma's inequality. Note that the dj's in Theorem 1 are equal to d™ ax + 1, 
and n in Theorem 1 is replaced with the length m = n(\ — R&) of the martingale sequence {Z t } (that is equal to 
the number of the parity-check nodes in the graph). 

In the continuation, we deviate from the proof in [52, Appendix I] in two respects: 

• The first difference is related to the upper bound on the conditional entropy Hg(X\Y) in (173) where X is 
the modulo-2 sum of some r bits of the transmitted codeword X given the channel output Y. Instead of taking 
the most trivial upper bound that is equal to 1, as was done in [52, Appendix I], a simple upper bound on the 
conditional entropy is derived; this bound depends on the parity-check degree r and the channel capacity C 
(see Proposition 6). 

• The second difference is minor, but it proves to be helpful for tightening the large-deviation inequality for 
LDPC code ensembles that are not right-regular (i.e., the case where the degrees of the parity-check nodes 
are not fixed to a certain value). Instead of upper bounding the term r + 1 on the right-hand side of (173) 
with ii™ ax + 1, it is suggested to leave it as is since Azuma's inequality applies to the case where the bounded 
differences of the martingale sequence are not fixed (see Theorem 1), and since the number of the parity-check 
nodes of degree r is equal to n(l — R&)T r . The effect of this simple modification will be shown in Example 10. 

The following upper bound is related to the first item above: 

Proposition 6: Let Q be a bipartite graph which corresponds to a binary linear block code whose transmission 
takes place over an MBIOS channel. Let X and Y designate the transmitted codeword and received sequence at the 
channel output. Let X = X^ © ... © Xi r be a parity-bit of some r code bits of X. Then, the conditional entropy 
of X given Y satisfies 

Hg{X\Y)<h 2 \——\. (175) 
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Further, for a binary symmetric channel (BSC) or a binary erasure channel (BEC), this bound can be improved to 

2 ( 2 J (176) 

and 

1 - C r (177) 

respectively, where 1 in (176) designates the inverse of the binary entropy function on base 2. 

Note that if the MBIOS channel is perfect (i.e., its capacity is C = 1 bit per channel use) then (175) holds with 
equality (where both sides of (175) are zero), whereas the trivial upper bound is 1. 

Proof: Let us upper bound the conditional entropy H(X\Y) with H(X\Yi 17 ... ,Yi r ), where the latter condi- 
tioning refers to the intrinsic information for the bits X; Ll ,.. . X; Lr which are used to calculate the parity-bit X. Then, 
from [64, Eq. (17) and Appendix I], the conditional entropy of the bit X given the n-length received sequence Y 
satisfies the inequality 

H(X\Y) < 1-— LV } 9p ^ - (178) 
V 1 21n2^p(2p-l v ' 

p=i 

where (see [64, Eq. (19)]) 

9p = J q o(0(! + e ~ l ) tanh2p dl > Vp G N (179) 

and a(-) denotes the symmetric pdf of the log-likelihood ratio at the output of the MBIOS channel, given that the 
channel input is equal to zero. From [64, Lemmas 4 and 5], it follows that 

9 P >C p , VpGN. 

Substituting this inequality in (178) gives that 

(jp r 



2 In 2 ^ p(2p - 1) 

p=i v 7 

= ^(^—f 1 ) (180) 
where the last equality follows from the power series expansion of the binary entropy function: 

(1 - 2x) 2 p 



0<x<l. (181) 



21n2^ p(2p-l) 

The tightened bound on the conditional entropy for the BSC is obtained from (178) and the equality 

g p = (l-2/ l 2 1 (l-C)) 2p , VpGN 

which holds for the BSC (see [64, Eq. (97)]). This replaces C on the right-hand side of (180) with (l-2/iJ ^l-C)) 2 , 
thus leading to the tightened bound in (176). 

The tightened result for the BEC holds since from (179) 

g p = C, Vp G N 

(see [64, Appendix II]), and a substitution of this equality in (178) gives (176) (note that YO^Li p{2p-i) = 2 In 2). 
This completes the proof of Proposition 6. ■ 
From Proposition 6 and (173) 

\Z t+l - Z t \<(r + l)h 2 (^—) (182) 



with the corresponding two improvements for the BSC and BEC (where the second term on the right-hand side of 
(182) is replaced by (176) and (177), respectively). This improves the loosened bound (d™ ax +l) in [52, Appendix I]. 
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From (182) and Theorem 1, we obtain the following tightened version of the large-deviation inequality in 
Theorem 9. 

Theorem 10: [A first tightened large-deviation inequality for the conditional entropy] Let C be chosen 
uniformly at random from the ensemble LDPC(n, A, p). Assume that the transmission of the code C takes place 
over an MBIOS channel. Let H(~K\Y) designate the conditional entropy of the transmitted codeword X given the 
received sequence Y at the channel output. Then 



P(|tf(X|Y) - E LDPC(niAjp) [tf(X|Y)]| >n0< 2eM-nBf) 



for every £ > 0, and 



B 



(183) 



where df ax is the maximal check-node degree, i?d is the design rate of the ensemble, and C is the channel capacity 
(in bits per channel use). 

For the BSC and BEC, the parameter B can be improved (increased) to 

1 



B 



-[l-2fe 2 - 1 (l-C)]' 



)] 



and 



B 



2(i- J Rd)£-£ 1 x (^ + i) 2 r l (i-^) 2 



(184) 



respectively 

Remark 20: From (183), Theorem 10 indeed yields a stronger large-deviation inequality than Theorem 9. 

Remark 21: In the limit where C — > 1 bit per channel use, it follows from (183) that if c^ nax < oo then B — > oo. 
This is in contrast to the value of B in Theorem 9 which does not depend on the channel capacity and is finite. 
Note that B should be indeed infinity for a perfect channel, and therefore Theorem 10 is tight in this case. 

In the case where d™ ax is not finite, we prove the following: 

Lemma 5: If <i™ ax = oo and p'(l) < oo then B — > oo in the limit where C — > 1. 

Proof: See Appendix I. ■ 
This is in contrast to the value of B in Theorem 9 which vanishes when <i™ ax = oo, and therefore Theorem 9 is 
not informative in this case (see Example 10). 

Example 9: [Comparison of Theorems 9 and 10 for right-regular LDPC code ensembles] In the following, we 
exemplify the improvement in the tightness of Theorem 10 for right-regular LDPC code ensembles. Consider the 
case where the communications takes place over a binary-input additive white Gaussian noise channel (BIAWGNC) 
or a BEC. Let us consider the (2, 20) regular LDPC code ensemble whose design rate is equal to 0.900 bits per 
channel use. For a BEC, the threshold of the channel bit erasure probability under belief-propagation (BP) decoding 
is given by 



inf 



0.0531 



*e(0,i] 1 - (1 - x) 19 

which corresponds to a channel capacity of C = 0.9469 bits per channel use. For the BIAWGNC, the threshold 
under BP decoding is equal to <7bp = 0.4156590. From [61, Example 4.38] which expresses the capacity of the 
BIAWGNC in terms of the standard deviation a of the Gaussian noise, the minimum capacity of a BIAWGNC 
over which it is possible to communicate with vanishing bit error probability under BP decoding is C = 0.9685 
bits per channel use. Accordingly, let us assume that for reliable communications on both channels, the capacity 
of the BEC and BIAWGNC is set to 0.98 bits per channel use. 

Since the considered code ensembles is right-regular (i.e., the parity-check degree is fixed to d c = 20), then B 
in Theorem 10 is improved by a factor of 

1 



ho 



-CT 



5.134. 
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This implies that the inequality in Theorem 10 is satisfied with a block length that is 5.134 times shorter than the 
block length which corresponds to Theorem 9. For the BEC, the result is improved by a factor of 



1 



9.051 



due to the tightened value of B in (184) as compared to Theorem 9. 

Example 10: [Comparison of Theorems 9 and 10 for a heavy-tail Poisson distribution (Tornado codes)] In the 
following, we compare Theorems 9 and 10 for Tornado LDPC code ensembles. This capacity-achieving sequence 
for the BEC refers to the heavy-tail Poisson distribution, and it was introduced in [45, Section IV], [71] (see also 
[61, Problem 3.20]). We rely in the following on the analysis in [64, Appendix VI]. 

Suppose that we wish to design Tornado code ensembles that achieve a fraction 1 — e of the capacity of a BEC 
under iterative message -passing decoding (where e can be set arbitrarily small). Let p designate the bit erasure 
probability of the channel. The parity-check degree is Poisson distributed, and therefore the maximal degree of the 
parity-check nodes is infinity. Hence, B = according to Theorem 9, and this theorem therefore is useless for the 
considered code ensemble. On the other hand, from Theorem 10 



£(' + !) 



( <E^ +1 ) 2 ^ 

i 

(b) Ei«(» + 2) 



1 - Ci 



1 2 



+ 1 



( = } (//(!) + 3K avg + l 



J? P( x ) dx 

p'(l)+3)< 
(d) /AW(1 



+ 3 ) d c avg + 1 



( c ) / 1 

< f — • 3 I I 



(f) 



o (itf(i)) 



where inequality (a) holds since the binary entropy function on base 2 is bounded between zero and one, equality (b) 
holds since 



Jo P( x ) dx 



where Tj and pi denote the fraction of parity-check nodes and the fraction of edges that are connected to parity-check 
nodes of degree i respectively (and also since J2i = 1), equality (c) holds since 



So P( x ) dx 



where dl vg denotes the average parity-check node degree, equality (d) holds since A'(0) = A2, inequality (e) is due 
to the stability condition for the BEC (where pA'(0)p'(l) < 1 is a necessary condition for reliable communication 
on the BEC under BP decoding), and finally equality (f) follows from the analysis in [64, Appendix VI] (an upper 
bound on A2 is derived in [64, Eq. (120)], and the average parity-check node degree scales like log ^). Hence, 
from the above chain of inequalities and (183), it follows that for a small gap to capacity, the parameter B in 
Theorem 10 scales (at least) like 

1 \ 



O 
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Theorem 10 is therefore useful for the large-deviation analysis of this LDPC code ensemble. As shown above, 
the parameter B in (183) tends to zero rather slowly as we let the fractional gap e tend to zero (which therefore 
demonstrates a rather fast concentration in Theorem 10). 

Example 11: This Example forms a direct continuation of Example 9 for the (n,d v ,d c ) regular LDPC code 
ensembles where d v = 2 and d c = 20. With the settings in this example, Theorem 9 gives that 

P(|ff(X|Y) -E LDPC(nAp) [iJ(X|Y)]| > nO 

< 2exp(-0.0113< 2 ) (185) 

for every £ > 0. As was mentioned already in Example 9, the exponential inequalities in Theorem 10 achieve an 
improvement in the exponent of Theorem 9 by factors 5.134 and 9.051 for the BIAWGNC and BEC, respectively. 
One therefore obtains via the inequalities in Theorem 10 that for every £ > 

P(|F(X|Y) -E LDPC(n , M [#(X|Y)]| > nO 

( 2exp(-0.0580< 2 ) BIAWGNC 

< < . (186) 
{ 2exp(-0.1023n£ 2 ), BEC 

G. Concentration Theorems for LDPC Code Ensembles over ISI channels 

Concentration analysis on the number of erroneous variable-to-check messages for random ensembles of LDPC 
codes was introduced in [46] and [60] for memoryless channels. It was shown that the performance of an individual 
code from the ensemble concentrates around the expected (average) value over this ensemble when the length of 
the block length of the code grows and that this average behavior converges to the behavior of the cycle-free case. 
These results were later generalized in [36] for the case of channels with memory (i.e., for ISI channels). In this 
section, we revisit the proofs of [36, Theorems 1 and 2] for the case of regular LDPC code ensembles in order 
to derive an explicit expression for the exponential rate that is related to the concentration inequality. It is then 
shown that particularizing the expression for memoryless channels provides a tightened concentration inequality as 
compared to [46] and [60]. 

1) The ISI Channel and its message-passing decoding: In the following, we briefly describe the ISI channel and 
the graph used for its message-passing decoding. For a detailed description, the reader is referred to [36]. Consider 
a binary discrete-time ISI channel with a finite memory length, to be denoted by / . The channel's output Y t at 
time t G Z is given by the equality 

I 

Y t = J2hiX t -i + N t 

i=0 

where X t G {+1,-1} is the channel's input, hi is the channel's response and N t ~ N(0,a 2 ) is an i.i.d. AWGN 
noise sequence. The information block of length k is coded using a regular (n, d v ,d c ) LDPC code, and the resulting 
n coded bits are converted to X t G {+1,-1} before transmission over the channel. For decoding we consider the 
windowed version of the "sum-product" algorithm when applied to ISI channels (see details in [36] and [22]). As in 
the memoryless case, this is a message passing algorithm. The variable-to-check and check-to-variable messages are 
computed as in the min-sum algorithm for the memoryless case with the difference that a variable node's message 
from the channel is not only a function of the the channel output that corresponds to the considered symbol but 
also a function of 2W neighboring channel outputs and 2W neighboring variables nodes as illustrated in Fig. 3. 

2 ) Concentration: It is proved in this sub-section that for a large n, a neighborhood of depth £ of a variable-to- 
check node message is tree-like with high probability. Using Azuma's inequality and the later result, it is shown that 
for most graphs and channel realizations, if s is the transmitted codeword, then the probability of a variable-to-check 
message being erroneous after i rounds of message-passing decoding is highly concentrated around its expected 
value. This expected value is shown to converge to the value of (s) which corresponds to the cycle-free case. 
Also, we prove that if the transmitted sequence is i.u.d., then the probability highly concentrated around the value 

In the following theorems, we consider an ISI channel and windowed message-passing decoding algorithm, when 
the code graph is chosen uniformly at random from the ensemble of the graphs with variable and check node degree 
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Fig. 3. Message flow neighborhood of depth 1. In this figure (I, W, d v = L,d c — R) = (1, 1, 2, 3) 

dy and d c respectively. Denote A/| as the neighborhood of depth £ of an edge e = (v,c) between a variable-to- 
check node. Let iV~, and to the be total number of check nodes, variable nodes and code related edges 
respectively in this neighborhood. Similarly denote Ny as the number of variable-to-check node messages in the 
directed neighborhood of depth £ of a received value of the channel. 

Theorem 11: [Probability of a neighborhood of depth £ of a variable-to-check node message to be tree-like 
for channels with ISI] Define Pi = Pr {WJ not a tree} as the probability that A/f is not tree-like. Then, there 

exists a positive constant j(d v , d c ,£) = N^ 2 + ^N^ 2 such that Pj < 2. 

Proof: This proof follows from the proof in [60] and extends it to the case of ISI channels. Consider a 
neighborhood A/| of fixed depth £ . Note that at each level the graph expands by factor a = (d v — l + 2Wd v )(d c — 1), 
therefore there are in total 

e-i 

N* = l + [(dy - l)(dc - 1) + 2W(1 + d v (dc - 1))] Yl ai 

i=0 

variable nodes and 

e-i 

= 1 + (d v - 1 + 2Wd v ) a { 

i=0 

check nodes in this neighborhood. 

In order to lower bound Pj we can upper bound P/ = 1 — P|. This is done by factorizing P/ as 

e-i 

p/ = p r {A/| is tree} \\ Pr {a/1* +1 is tree|A/|* is tree} (187) 

^*=o 

and bounding each factor. For £* = we have a single edge which is a tree, therefore Pr {A/"S is tree} = 1. To 
bound Pr |a/"| +1 is tree|A/|* is tree} we assume that jVj* is tree-like and reveal the code related edges (variable- 
to-check node or vice versa, as opposed to the channel related edges which are predetermined) one at a time. If in 
this process (of revealing the £* + 1-th level of the tree) no loops are created then A/^* +1 is also a tree. We start 
by revealing the leaves of a variable node . As opposed to the case with no ISI, where each variable node has only 
d v — 1 direct paths to check nodes from the next level, here also 2Wd v indirect paths through trellis nodes exist 
(i.e, variable-trellis-variable-check). Since the edges connected to the trellis nodes are predetermined then also the 
indirect path requires the revelation of a single variable-to-check node edge. Assume that k additional edges have 
been revealed at this stage without creating a loop. The next revealed edge is chosen among md c — k — Nf edges 
and it does not create a loop if it is connected to one of the (m — k — N%* ) un-explored check nodes. Since each 
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un-explored check node has d c edges then the probability for not creating a loop is given by -fc-Ay' - ^ or 
large n we have 

(m-k- Nf)d c 
md c — k — N£* 
(ATf +k)(d c -l) 



md c — 

N e 

> 1- — 

m 



k 



(188) 



We have iV c — N%* edges to reveal (one for each check node), therefore, the probability that revealing all variable 

node leaves does not creates a loop, given WJ is tree-like is lower bounded by II — J . Next, we 

reveal the outgoing edges of the check node leaves one at a time (here only d c direct paths exist, as in the case 
without ISI). Assuming k variable nodes have been revealed without creating a loop, then the probability that the 
next revealed edge does no create a loop is ^j^^v^ ■ For large n we have 



-k-N v e 

(n — k 



nd v -k- Nf 
= (iVf + fc)(d v - 1 
nd v — k — N%* 
N e 

> 1- — . 

n 



(189) 



We have K 



iV~* edges to reveal (one for each variable node), therefore, the probability that revealing 



all check node leaves does not creates loop, given the neighborhood is tree-like so far is lower bounded by 
(l - if) . Combining (187), (189), (188) and Pf = 1 - Pi we have 



Pi<l 



Thus, for n sufficiently large 



( 



N e 2 , d^ N e 



n 



1 
n 



Theorem 12: [Concentration of the number of erroneous variable-to-check messages for channels with ISI] 

Let s be the transmitted codeword. Let Z £ (s) be the number of erroneous variable-to-check messages after £ rounds 
of the windowed message-passing decoding algorithm when the code graph is chosen uniformly at random from 
the ensemble of the graphs with variable and check node degree d v and d c respectively. Let (s) be the expected 
fraction of incorrect messages passed along an edge with a tree-like directed neighborhood of depth I. Then, there 
exist positive constants (3(d v ,d c ,£) = md^mW+TNfFj and l(d v ,d c ,£) 



iVf + Jivf such that 



[Concentration around expectation] For any e > we have 

Z\s) E[Z\s)} 



Pr 



> e/2 } < 2e~^ 2n 



nd v nd v 

[Convergence to cycle-free case] For any e > and n> ^ we have 

E[z*(a)] 



nd v 



pW(s) 



< e/2 



[Concentration around cycle-free case] For any e > and n> ^ we have 



Pr 



nd v 



> e } < 2e 



(190) 



(191) 



(192) 
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Proof: First note that for n > ^ the following inequality holds 



Pr 

< Pr 



nd v 

zHs) 



P W {s) 



> € 



E[Z e (s)} 



nd v 



If inequality (191) holds, then Pr j 



Z e {s) 



nd v 



> e/2) +Pr 



[Z e (s)\ 



> e/2 



(193) 



> ei 



e/2 j = 0, therefore using (193) we deduce that (192) follows 

from (190) and (191). We start by proving (190). For a deterministic sequence s the random variable Z e (s) denotes 
the number of incorrect variable-to-check node messages among all nd v variable-to-check node messages passed in 
the £th iteration for a particular graph Q and decoder's input Y_. Let us form a Doob's martingale by first exposing 
the nd v edges of the graph one by one and then exposing the n received values Yi one by one. For i = 0, ...n(d v + l), 
define the RV Z t = E[Z e (s)\ ai,...dj], where the sequence a is the sequence of the nd v variable-to-check node 
edges of the graph followed by the sequence of the n received values. Note that it is a martingale sequence where 
Zq = E[Z e (s)] and Z n ^ +1 ^ = Z e (s). We can use Azuma's inequality if we can bound the sequence of differences 

\Zi+l ~ Z%\ < di- 

We now consider the effect of exposing an edge of the graph. Consider two graphs Q and Q whose edges are 
identical except an exchange of the endpoint of two edges. A variable-to-check message is affected by this change 
if one (or both) of the edges is in its directed neighborhood of depth I. 

Consider a neighborhood of depth i of a variable-to-check node message. Since at each level the graph expands 
by factor a = (d v — 1 + 2Wd v )(d c — 1) then there are, in total 

e-i 

N l e = l + d c {d v - 1 + 2Wd v ) a* 

i=0 

edges related to the code structure (variable-to-check node edges or vice versa) in the neighborhood Mf. By 
symmetry the two edges can affect at most 2N e neighborhoods (Alternatively we could directly sum the number 
of variable-to-check node edges in a neighborhood of a variable-to-check node edge and in a neighborhood of a 
check-to-variable node edge). The change in the number of incorrect variable-to-check node messages is bounded 
by the case that each change in the neighborhood of a message introduces an error. In a similar manner, when we 
reveal a received value, then variable-to-check node messages whose directed neighborhood include that channel 
input can be affected. We consider a neighborhood of depth I of a received value. By counting, it can be shown 
that this neighborhood includes 



N$ = d v (2W + l 



e-i 

E 

i=0 



a 



variable-to-check node edges. Therefore a change in a received value can affect up to Ny variable-to-check node 
messages. We conclude that d{ < 2N% for the first d v n exposures and dj < N Y for the last n exposures. By 
applying Azuma's inequality we get 



Pr 



ZHs) 



E[Z e (s) 



("dye/2) 2 



j > e/2 \ < e 2 i™ d v(2]vj) 2 +„(ivf f] 
nd v nd v 1 1 

By comparing the result to (190), we get an expression for (5 

1 

Next, we prove inequality (191), again it is adopted from [60] and [36]. Let E[Zf(s)],i G [nd v ] be the expected 
number of incorrect messages passed along edge iti, where the average is over all graphs and all received values. 
Then by linearity of expectation and by symmetry 



8(4d v (A^ + K) 2 ) K 2 



E[ZHs)\ = £ nzHs)] = nd v E[Zi(s)]. 



(194) 
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Furthermore 



E[Z[{s)\ = E[Zf (s)|A/f is tree]P/ + E[Z[{s)\N§ not a tree]Pi 



As shown in Theorem 11, Pj < ^ where 7 is a positive constant independent of n. Furthermore, we have 
E[Zf (s) I neighborhood is tree] = p^(s) and by definition < E[Zf(s)| neighborhood not a tree] < 1. Hence 



E[Z((s)] > (1 - P- t )p (l) {s) > pW(s) - Pf 



(195) 



Using (194), (195) and Pj < 



we get that 

E[^(l)] 



<p/<^ 

n 



It follows that if n > ^ then (191) holds. ■ 
Discussion 2: The concentration result proved above is a generalization of the results given in [60] for the 
memoryless case. One can degenerate the expression 4 = 8 (4d v (iVg) 2 + (iVy) 2 ) /d 2 to the memoryless case by 
setting W = and 7 = 0. Since we used exact expressions for and N Y in the proof, we can expect a tighter 
bound as compared to the earlier result J-^ = 544d% e ~ 1 d% e given in [60]. For example for (d v , d c ,£) = (3, 4, 10) we 
get an improvement by a factor of about 1 million. However even with this improved expression, the required size 
of n according to our proof can be absurdly large. This is because the proof is very pessimistic. We assume that 
any change in an edge or the decoder's input will introduce an error in every message it affects. This is especially 
pessimistic if large I is considered, since as I grows each message is a function of many edges and received values 
(since the neighborhood grows with £). However in practice, the probability that changing a single edge or input 
will change the message is close to zero for long codes. 

Theorem 13: Let s be a random sequence of i.u.d. binary variables S±, S2----S n . Let Z e (S_) be the number of 
erroneous variable-to-check messages after i rounds of the windowed message-passing decoding algorithm when 
the code graph is chosen uniformly at random from the ensemble of the graphs with variable and check node 
degree d v and d c respectively. Let pp ud = E[p^(s)] be the expected fraction of incorrect messages passed along 
an edge with a tree-like directed neighborhood of depth I. Then, there exist positive constants /3' = /3(d v ,d c ,£), 
and 7 = j(d v , d c ,£) such that for any e > and n > we have 



Pr 



Z £ (S) 



nd v 



Pi.u.d. 



> e } < 4e" 



(196) 



Furthermore, p\ ' u d is equal to the error probability when all neighborhood types are equally probable. 



Proof: The proof follows closely the one presented in [36]. First, note that the following chain of inequalities 



hold 



Pr 



Z\S) 



nd v 



Pi.u.d. 



> e 



j'=i 

2" 



zHsi 



nd v 

zHs. 



Pi.u.d. 



> e 



nd v 



> e/2 



< ^2" n P: 
3=1 

2" 

< J2 2 ~ n ■2e- /3e2n / 4 + Pr{|p l 

3=1 

= 2e-^ + Y>r{\p^{S)-pfl d \>e/2} 



2" 

+ £ 2 - 



j'=i 



P r {| p W (fi) _ p f u ) d | >e/ 2} 



(i) 



(S) 



{t) I 

Pi.u.d. 



>e/2} 



(197) 



To bound the second term in the last line we shall use Azuma's inequality. Let us form a Doob's martingale by 
exposing the n received symbols one by one. For t = l,...,n, define the RV M t = E[p e (S_)\Si, S2, St]. Note 
that M = E\p e {S)} = pf ] ud and M n = E[p € (S)|5i, S 2 , S n ] = p e (S). In order to use Azuma's inequality we 
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shall show that the sequence of differences is bounded \M t+ \ — M t \ < d t . Since the channel has ISI of degree 
I, then exposing a single channel input affects I channel output (which are the received values for the decoder). 
A variable-to-check node message is affected only if one of the affected received values are in its neighborhood. 
Therefore, changing a channel input can affect at most INy variable-to-check node messages among the nd v 
messages in the graph. Thus \M t+ \ — M t \ < and by using Azuma's inequality we have 

Pr { |pW (S) - pfl d | > e/2} < 2e-^ n (198) 
where 5 = | ( j^r ) . Combining (198), (197) and comparing it to (196) gives that /?' = min(/3, 6). 



(t) 

Next, we get an expression for p\ u d and show it is equal to the error probability when all neighborhood types 
are equally probable. In Fig. 3, a depth 1 message-flow neighborhood is shown. The row of bits "0101" given 
above the trellis section represent the binary symbols of the codeword S_ corresponding to the trellis nodes that 
influence the message flow. Since the channel has ISI memory of length /, there are 2W + / + 1 binary symbols 
of that influence the message flow, we call this sequence of bits a neighborhood type. For example, Fig. 3, the 
neighborhood type is 9 = [0101]. We expand this definition to a depth I neighborhood by cascading the bits of 
each sub-neighborhood of depth I. Since at each level, the graph expends by factor a = (d v — 1 + 2Wd v )(d c — 1) 
then there are exactly possible types of message flow neighborhoods of depth £, where 

or — 1 



N(t) = (2W + I + 1) J2 ai = ( 2W + 1 + i)-^— j- 



i=0 



We can now define 



and 



7Tg = Pr (tree delivers incorrect messagejtree type 6) 



P(9\s) = Pr (tree type \ transmitted sequence = s) 
Therefore we can express (S_) as 



Next, recognize that if S_ is an i.u.d. sequence, all neighborhood types are equally probable, i.e. Pr (9\S_) = 2 N W. 
Using this we have 

2" 

np {e) (s)] = £vy%) 

3=0 

2 n 2 N( - t > 

j=0 i=0 

= E 7r 2 ) E2-"Pr(^| £ ,) 
j=0 i=0 

i=0 

2 N(€) 

i=0 

The last term is equal to the error probability when all neighborhood types are equally probable. Since = 
pfu d tne the 01 " 6111 i s proved. 
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H. Expansion of Random Regular Bipartite Graphs 

Azuma's inequality is useful for analyzing the expansion of random bipartite graphs. The following theorem was 
introduced in [72, Theorem 25]. It is stated and proved here slightly more precisely, in the sense of characterizing 
the relation between the deviation from the expected value and the exponential convergence rate of the resulting 
probability. 

Theorem 14: [Expansion of random regular bipartite graphs] Let Q be chosen uniformly at random from 
the regular ensemble LDPC(n, x 1 ^ 1 , x r ~ r ). Let a G (0,1) and S > be fixed. Then, with probability at least 
1 — exp(— Sn), all sets of an variables in Q have a number of neighbors that is at least 



n 



" 1 - (1 -°»'> -^( fe(a .) + ,) 



(199) 



where h designates the binary entropy function to the natural base (i.e., h(x) = —xln(x) — (1 — x) ln(l — x) for 
x€[0,l]). 

Proof: The proof starts by looking at the expected number of neighbors, and then exposing one neighbor at a 
time to bound the probability that the number of neighbors deviates significantly from this mean. 
Note that the number of expected neighbors of an variable nodes is equal to 

nl(l - (1 -a) r ) 
r 

since for each of the ^ check nodes, the probability that it has at least one edge in the subset of na chosen variable 
nodes is 1 — (1 — a) r . Let us form a martingale sequence to estimate, via Azuma's inequality, the probability that 
the actual number of neighbors deviates by a certain amount from this expected value. 

Let V denote the set of na nodes. This set has nal outgoing edges. Let us reveal the destination of each of 
these edges one at a time. More precisely, let Si be the RV denoting the check-node socket which the z-th edge is 
connected to, where i G {1, . . . , nal}. Let X(Q) be a RV which denotes the number of neighbors of a chosen set 
of na variable nodes in a bipartite graph Q from the ensemble, and define for % = 0, . . . , nal 

X i = E[X(G)\S 1 ,...,S i - 1 ]. 

Note that it is a martingale sequence where Xo = E[X(£?)] and X na i = X{Q). Also for every i € {1, . . . ,nal}, 
we have |JQ — < 1 since every time only one check-node socket is revealed, so the number of neighbors 

of the chosen set of variable nodes cannot change by more than 1 at every single time. Thus, by the one-sided 
Azuma's inequality derived in Section III-A 

P(E[X(g)] - X{Q) > xVlan) < exp(-y), V A > 0. 

Since there are ( n ) choices for the set V then, from the union bound, the event that there exists a set of size na 
whose number of neighbors is less than E[X(C/)] — AV lan occurs with probability that is at most (J^J exp(— 4r). 
Since ( n )< e nh( - a \ then we get the loosened bound 

\nctJ — ' b 

exp(nh(a) - y). 

Finally, the choice A = J 2n(h(a) + 6) gives the required result. ■ 
Remark 22: It is noted that Theorem 14 uniformly improves the statement in [61, Problem C.4] for every 5 > 0. 
This holds even in the case where a — > 1 (i.e., when considering a set that includes almost all the variable nodes in 
the bipartite graph, and whose number of neighbors is expected to be close to ^). The expression which appears 
there, instead of (199), is given by 



n 



l ^- {1 - a) ^ -V2laT(aj-5, 



la 



2h{a) 



so it tends to — oo (for every 5 > 0) in the case where a — >■ 1, whereas (199) tends nearly to y (for small 5 > 0) 
as expected. 
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/. Concentration of the Crest-Factor for OFDM Signals 

Orthogonal-frequency-division-multiplexing (OFDM) is a modulation that converts a high-rate data stream into 
a number of low-rate steams that are transmitted over parallel narrow-band channels. OFDM is widely used in 
several international standards for digital audio and video broadcasting, and for wireless local area networks. For 
a textbook providing a survey on OFDM, see e.g. [53, Chapter 19]. One of the problems of OFDM signals is 
that the peak amplitude of the signal can be significantly higher than the average amplitude. This issue makes the 
transmission of OFDM signals sensitive to non-linear devices in the communication path such as digital to analog 
converters, mixers and high-power amplifiers. As a result of this drawback, it increases the symbol error rate and it 
also reduces the power efficiency of OFDM signals as compared to single-carrier systems. Commonly, the impact 
of nonlinearities is described by the distribution of the crest-factor (CF) of the transmitted signal [43], but its 
calculation involves time-consuming simulations even for a small number of sub-carriers. The expected value of 
the CF for OFDM signals is known to scale like the logarithm of the number of sub-carriers of the OFDM signal 
(see [43], [62, Section 4] and [79]). 

Given an n-length codeword {Xi}™~Q, a single OFDM baseband symbol is described by 

s(t) = ^J2 x i exp (S^J ' - 1 - T - (200) 

v n i=o 

Lets assume that Xq, . . . , X n _i are complex RVs, and that a.s. = 1 (these RVs should not be necessarily 

independent). Since the sub-carriers are orthonormal over [0, T], then the signal power over the interval [0, T] is 1 

a.s., i.e., -. „t 

- \ \s(t)\ 2 dt = l. (201) 
1 Jo 

The CF of the signal s, composed of n sub-carriers, is defined as 

CF„(s) = max>(t)|. (202) 



From [62, Section 4] and [79], it follows that the CF scales with high probability like \/fnn for large n. In [43, 
Theorem 3 and Corollary 5], a concentration inequality was derived for the CF of OFDM signals. It states that for 
an arbitrary c > 2.5 

c In In n \ / 1 \ 
< , =1-0 -r • 



^ CF n (s) - Vh^l 



(Inn) 



Remark 23: The analysis used to derive this rather strong concentration inequality (see [43, Appendix C]) requires 
some assumptions on the distribution of the Xj's (see the two conditions in [43, Theorem 3] followed by [43, 
Corollary 5]). These requirements are not needed in the following analysis, and the derivation of concentration 
inequalities that are introduced in this subsection are much more simple and provide some insight to the problem, 
though they lead to weaker concentration result than in [43, Theorem 3]. 

In the following, Azuma's inequality and a refined version of this inequality are considered under the assumption 
that {Xj}™~q are independent complex-valued random variables with magnitude 1, attaining the M points of an 
M-ary PSK constellation with equal probability. This material was presented in part in [66]. 

1) Establishing Concentration of the Crest-Factor via Azuma's Inequality: In the following, Azuma's inequality 
is used to derive a concentration result. Let us define 

Y i = E[CF n (s)\X ,...,X i _ 1 ], i = 0,...,n (203) 

Based on a standard construction of martingales, {Yi,Ti}f =0 is a martingale where Ti is the a-algebra that is 
generated by the first i symbols (Xq, . . . , in (200). Hence, To Q T\ C . . . C T n is a filtration. This 

martingale has also bounded jumps, and 

\Yi-Yi-x\ < 4= 

for i e {1, . . . , n} since revealing the additional i-th coordinate X{ affects the CF, as is defined in (202), by at 
most ^= (see the first part of Appendix J). It therefore follows from Azuma's inequality that, for every a > 0, 

P(|CF„(s) - E[CF n (s)]| > a) < 2exp ( ~) (204) 
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which demonstrates concentration around the expected value. 

2) Establishing Concentration of the Crest-Factor via the Refined Version ofAzuma's Inequality in Proposition 4: 
In the following, we rely on Proposition 4 to derive an improved concentration result. For the martingale sequence 
{Y;}r=o in (203), Appendix J gives that a.s. 

\Yi - iS-il < 4= , " < - (205) 



for every i G {1, . . . , n}. Note that the conditioning on the cr-algebra Ti-\ is equivalent to the conditioning on the 
symbols Xq, . . . , X,^2, and there is no conditioning for i = 1. Further, let Zi = ^fnYi for < i < n. Proposition 4 
therefore implies that for an arbitrary a > 

P(|CF n (a)-E[CF n (s)]| > a) 
= H\Y n - Y \ > a) 
= F{\Z n -Z \ > ay/n) 

<2exp(4(l + 0(-L)) (206) 

(since 5 = § and 7 = \ in the setting of Proposition 4). Note that the exponent in the last inequality is doubled as 
compared to the bound that was obtained in (204) via Azuma's inequality, and the term which scales like o(^-^j 
on the right-hand side of (206) is expressed explicitly for finite n (see Appendix F). 

3) A Concentration Inequality via Talagrand's Method: In his seminal paper [74], Talagrand introduced an 
approach for proving concentration inequalities in product spaces. It forms a powerful probabilistic tool for 
establishing concentration results for coordinate-wise Lipschitz functions of independent random variables (see, 
e.g., [21, Section 2.4.2], [50, Section 4] and [74]). This approach is used in the following to derive a concentration 
result of the crest factor around its median, and it also enables to derive an upper bound on the distance between 
the median and the expected value. We provide in the following definitions that will be required for introducing a 
special form of Talagrand's inequalities. Afterwards, this inequality will be applied to obtain a concentration result 
for the crest factor of OFDM signals. 

Definition 2 (Hamming distance): Let x, y be two n-length vectors. The Hamming distance between x and y is 
the number of coordinates where x and y disagree, i.e., 



n 



<fe(x,y) = /ZhxM 
1 



where / stands for the indicator function. 

The following suggests a generalization and normalization of the previous distance metric. 

Definition 3: Let a = (a\, . . . , a n ) G (i.e., a is a non-negative vector) satisfy ||a|| 2 = XT=i( a «) 2 = x - Then, 
define 

n 

do(x,y) = z~2 aiI {^Vi}- 
i=i 

Hence, d H (x,y) = ^d a (x,y) for a = (-^,... ,-^). 

The following is a special form of Talagrand's inequalities ([50, Chapter 4], [74], [75]). 

Theorem 15 (Talagrand's inequality): Let the random vector X = (X\, . . . ,X n ) be a vector of independent 
random variables with Xk taking values in a set Ak, and let A = Ylk=i ^k- Let / : A — > R satisfy the condition 
that, for every x£i, there exists a non-negative, normalized n-length vector a = a(x) such that 

/(x)</(y)+^ a (x,y), Vy G A (207) 
for some fixed value a > 0. Then, for every a > 0, 

F(\f(X) - m\ > a) < 4exp (~^\ (208) 
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where m is the median of f(X) (i.e., F(f(X) < m) > \ and F(f(X) > m) > \). The same conclusion in (208) 
holds if the condition in (207) is replaced by 

/(y) </(x) + <r4(x,y), Vy G A. (209) 

At this stage, we are ready to apply Talagrand's inequality to prove a concentration inequality for the crest factor 
of OFDM signals. As before, let us assume that Xq,Yo,..., X n -±, Y n -i are i.i.d. bounded complex RVs, and also 
assume for simplicity that = \Y^\ = 1. In order to apply Talagrand's inequality to prove concentration, note 
that 

max I s(t;X , . . . ,X n -i) \ - max I s(t; Y , . . . , F n _i)| 
o<t<T' 1 o<t<T' 1 

- n^^J s (^ x o,-- ■ ,^n-i) - s(t;Y ,.. . ,y n _i)| 



0<t<T 
1 



< 

In 



g(^-F i )exp(iM) 



i=0 

n-1 



2 

sjn 



i=0 
n-1 



2 x ^ 

=0 



2d a (^,y) 



where 



a 4(-L ..,_L) (210) 



is a non-negative unit-vector of length n (note that a in this case is independent of x). Hence, Talagrand's inequality 
in Theorem 15 implies that, for every a > 0, 

P(|CF n (s) - m n | >q) <4exp(-^) (211) 

where m n is the median of the crest factor for OFDM signals that are composed of n sub-carriers. This inequality 
demonstrates the concentration of this measure around its median. As a simple consequence of (211), one obtains 
the following result. 

Corollary 8: The median and expected value of the crest factor differ by at most a constant, independently of 
the number of sub-carriers n. 

Proof: By the concentration inequality in (211) 



|E[CF n (s)]-m n | <E|CF n (s)-m n | 

POO 

= / P(|CF n (s) - m n \ > a) da 
Jo 



< I 4exp -— da 



o 



16 J 



Remark 24: This result applies in general to an arbitrary function / satisfying the condition in (207), where 
Talagrand's inequality in (208) implies that (see, e.g., [50, Lemma 4.6]) 

\E[f(X)]-m\ < 4cj0r- 
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4) Establishing Concentration via McDiardmid's Inequality: McDiardmid's inequality (see Theorem 2) is applied 
in the following to prove a concentration inequality for the crest factor of OFDM signals. To this end, let us define 

U = maxJs(t;X , . . . . . . ,X n _i)| 

V = max \s(t; X , . . . , X'^, X u . . . , X n -i)\ 

0<t<T' 1 

where the two vectors (Xq, . . . , Xj_i, Xj, . . . , X n _i) and Xo, . . . , X l '_ 1 , Xj, . . . , X n _i) may only differ in their 
i-th coordinate. This then implies that 

\U - V\ < max \s(t;X , . . . ,Xj_i,Xj, . . . ,X n _i) 



— s(t; X , . . . , X t '_i,Xi, . . . , X„_i)| 



1 

max —= 

0<t<T y/n 

|Xj_i - X t -_il 2 



n \/n 



where the last inequality holds since |Xj_i| = |X-_ 1 | = 1. Hence, McDiarmid's inequality in Theorem 2 implies 
that, for every a > 0, 

P(|CF n (s) - E[CF n (*)]| > a) < 2exp(-^) (212) 

which demonstrates concentration of this measure around its expected value. By comparing (211) with (212), it 
follows that McDiarmid's inequality provides an improvement in the exponent. The improvement of McDiarmid's 
inequality is by a factor of 4 in the exponent as compared to Azuma's inequality, and by a factor of 2 as compared 
to the refined version of Azuma's inequality in Proposition 4. 

5) Summary: This subsection derives four concentration inequalities for the crest-factor (CF) of OFDM signals 
under the assumption that the symbols are independent. The first two concentration inequalities rely on Azuma's 
inequality and a refined version of it, and the last two concentration inequalities are based on Talagrand's and 
McDiarmid's inequalities. Although these concentration results are weaker than some existing results from the 
literature (see [43] and [79]), they establish concentration in a rather simple way and provide some insight to the 
problem. The use of these bounding techniques, in the context of concentration for OFDM signals, seems to be 
new. McDiarmid's inequality improves the exponent of Azuma's inequality by a factor of 4, and the exponent of 
the refined version of Azuma's inequality from Proposition 4 by a factor of 2. Note however that Proposition 4 
may be in general tighter than McDiarmid's inequality (if 7 < \ in the setting of Proposition 4). It also follows 
from Talagrand's method that the median and expected value of the CF differ by at most a constant, independently 
of the number of sub-carriers. 



/. Random Coding Theorems via Martingale Inequalities 

The following sub-section establishes new error exponents and achievable rates of random coding, for channels 
with and without memory, under maximum-likelihood (ML) decoding. The analysis relies on some exponential 
inequalities for martingales with bounded jumps. The characteristics of these coding theorems are exemplified in 
special cases of interest that include non-linear channels. The material in this sub-section is based on [80], [81] 
and [82] (and mainly on the latest improvements of these achievable rates in [82]). 

Random coding theorems address the average error probability of an ensemble of codebooks as a function of the 
code rate R, the block length N, and the channel statistics. It is assumed that the codewords are chosen randomly, 
subject to some possible constraints, and the codebook is known to the encoder and decoder. 

Nonlinear effects are typically encountered in wireless communication systems and optical fibers, which degrade 
the quality of the information transmission. In satellite communication systems, the amplifiers located on board 
satellites typically operate at or near the saturation region in order to conserve energy. Saturation nonlinearities of 
amplifiers introduce nonlinear distortion in the transmitted signals. Similarly, power amplifiers in mobile terminals 



THE MARTINGALE APPROACH FOR CONCENTRATION AND APPLICATIONS 



61 



are designed to operate in a nonlinear region in order to obtain high power efficiency in mobile cellular commu- 
nications. Gigabit optical fiber communication channels typically exhibit linear and nonlinear distortion as a result 
of non-ideal transmitter, fiber, receiver and optical amplifier components. Nonlinear communication channels can 
be represented by Volterra models [9, Chapter 14]. 

Significant degradation in performance may result in the mismatched regime. However, in the following, it is 
assumed that both the transmitter and the receiver know the exact probability law of the channel. 

We start the presentation by writing explicitly the martingale inequalities that we rely on, derived earlier along 
the derivation of the concentration inequalities in this chapter. 

1) Martingale inequalities: 

• The first martingale inequality is a known result (see [21, Corollary 2.4.7] and [49]) that will be useful later 
in this paper. 

Theorem 16: Let {X k ,T k }^ = Q, f° r some n G N, be a discrete -parameter, real-valued martingale with bounded 
jumps. Let 

£k = X k -X k _ u Vfce{l,...,n} 
designate the jumps of the martingale. Assume that, for some constants d, a > 0, the following two requirements 

£* < d, Var(ai^fe-i) < a 2 
hold almost surely (a.s.) for every k G {1, . . . , n}. Let 7 = Then, for every t > 0, 



E 



k=i 



~1 td _|_ ryptd 



The proof of this theorem relies on Bennett's inequality (see [10] and [21, Lemma 2.4.1], and it was presented 
earlier in this chapter for the derivation of the first refinement of the Azuma-Hoeffding inequality. 
• Second inequality: The following theorem presents a new martingale inequality that will be useful later in this 
sub-section. 

Theorem 17: Let {X^, J 7 k}k=0' ^ or some n G N, be a discrete-time, real-valued martingale with bounded 
jumps. Let 

£,k = X k -X k _ u Vfce {l,...,n} 

and let m G N be an even number, d > be a positive number, and {/U/}^ ^ e a sequence of numbers such 
that 



E[(&)'|.F fc _i] < m, V/G{2,...,m} 



(214) 
(215) 



holds a.s. for every k G {1, . . . , n}. Furthermore, let 



7/ = ^, W€{2,...,m}. 



(216) 



Then, for every t > 0, 



E 



expf t^2^ k 

^ k=i 



< 



' 1+ g (7,-7 T )W +7m(eM _ 1 _ t(i) 

1=2 



(217) 
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2) Achievable Rates under ML Decoding: The goal of this sub-section is to derive achievable rates in the random 
coding setting under ML decoding. We first review briefly the analysis in [81] for the derivation of the upper bound 
on the ML decoding error probability. This review is necessary in order to make the beginning of the derivation 
of this bound more accurate, and to correct along the way some inaccuracies that appear in [81, Section II]. After 
the first stage of this analysis, we proceed by improving the resulting error exponents and their corresponding 
achievable rates via the application of the martingale inequalities in the previous sub-section. 

Consider an ensemble of block codes C of length N and rate R. Let C G C be a codebook in the ensemble. The 
number of codewords in C is M = [exp(iVi?)]. The codewords of a codebook C are assumed to be independent, 
and the symbols in each codeword are assumed to be i.i.d. with an arbitrary probability distribution P. An ML 
decoding error occurs if, given the transmitted message m and the received vector y, there exists another message 
m! / m such that 

||y - Du m ,\\ 2 < ||y - £>u m || 2 . 
The union bound for an AWGN channel implies that 

where 

Q{x) = —=j exp(--)di, VieK (218) 

V27T Jx V 2 / 

is the complementary Gaussian cumulative distribution function. By using the inequality Q(x) < ^ exp(— ^) for 
x > 0, it gives the loosened bound (by also ignoring the factor of one-half in the bound of Q) 

d in\ s ( \\Du m - Du m ,\\l\ 
P e \ m {C) < 2^ ex P ( g^2 ) • 

At this stage, let us introduce a new parameter p 6 [0, 1], and write 

P e \m{C) < 2^ eX P ( g^2 ) • 

m> V v J 

Note that at this stage, the introduction of the additional parameter p is useless as its optimal value is p opt = 1. 
The average ML decoding error probability over the code ensemble therefore satisfies 

Pc\m < E 

and the average ML decoding error probability over the code ensemble and the transmitted message satisfies 

P e < (M - 1) E 

where the expectation is taken over two randomly chosen codewords u and u where these codewords are indepen- 
dent, and their symbols are i.i.d. with a probability distribution P. 

Consider a filtration To C T\ C . . . C where the sub <r-algebra Ti is given by 

T i ±a{U 1 ,U 1 ,...,U i ,U i ), Vi€{l,...,iV} (220) 

for two randomly selected codewords u = (u\, . . . ,un), and u = (ui, . . . ,un) from the codebook; T{ is the 
minimal a-algebra that is generated by the first i coordinates of these two codewords. In particular, let To = {0, ^} 
be the trivial cr-algebra. Furthermore, define the discrete-time martingale {Xk,Tk}^ =0 by 



E( p\\Du rn - Du m ,\\l\ 



exp - 



p\\Du 



Du\\i 



Sal 



(219) 



X k = E[\\Du-Du\\ 2 2 \T k ] 



(221) 
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designates the conditional expectation of the squared Euclidean distance between the distorted codewords Du and 
Du given the first i coordinates of the two codewords u and u. The first and last elements of this martingale 
sequence are, respectively, equal to 

X = E[\\Du- Du\\l\, X N = \\Du - Du\\l 



(222) 



Furthermore, following earlier notation, let = X k — Xk-i be the jumps of the martingale, then 



N 



= X N -X = \\Du - Du\\l-E[\\Du - Du|||] 



k=i 



and the substitution of the last equality into (219) gives that 

pE[||Du-Du|||l] 



P e < exp(NR) exp 



8(7,2 



E 



cxp 



p 

8(7 2 



N 



fc=i 



(223) 



Since the codewords are independent and their symbols are i.i.d., then it follows that 

E||Du-Du|||| 

N 

= 52^[{[Du] k -[Du] k y 

k=l 
N 



J^Var ([Du] k - [Du] k ] 

k=l 
N 

2^Var([Pu] fe ) 



k=i 

(q-i n \ 

= 2 ^Var([Pu] fe ) + J] Var ([Du] fe ) . 

\k=l k=q J 

Due to the channel model (see Eq. (240)) and the assumption that the symbols {u{\ are i.i.d., it follows that 
Var ([Du]fc) is fixed for k = q, . . . , N. Let D V (P) designate this common value of the variance (i.e., D V (P) = 
Var ([-Dujfe) for k > q), then 

q-l 



E||Pu - Du|||l = 2 (j2 Var ([Du] k ) + (N - q + 1)D V (P) 
V k=l 



Let 



C P (P) 4 exp 



8(7,2 



^Var([£>u] fc )-(g-l)D v (P) 



. fc=i 



which is a bounded constant, under the assumption that ||u||oo < K < +oo holds a.s. for some K > 0, and it is 
independent of the block length N. This therefore implies that the ML decoding error probability satisfies 



Pe < C P {P) exp <^ -N 



PPv(P) 
4(7,2 



-R 



E 



exp 



N N 



VpG [0,1]. 



(224) 



where Z*. 4 — £ fc> so {Z k ,Th} is a martingale-difference that corresponds to the jumps of the martingale {— X k , F k }. 
From (221), it follows that the martingale-difference sequence {Z^,Pfc} is given by 



Zk — X k -i — x k 



\Du- Du\\i\T k ^] 



\Du-Du\\ 2 2 \T k ] 



(225) 



For the derivation of improved achievable rates and error exponents (as compared to [81]), the two martingale 
inequalities presented earlier in this sub-section are applied to the obtain two possible exponential upper bounds 
(in terms of N) on the last term on the right-hand side of (224). 
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Let us assume that the essential supremum of the channel input is finite a.s. (i.e., IHloo is bounded a.s.). Based 
on the upper bound on the ML decoding error probability in (224), combined with the exponential martingale 
inequalities that are introduced in Theorems 16 and 17, one obtains the following bounds: 

1) First Bounding Technique: From Theorem 16, if 

Z k <d, Var(Z fc | F k ^) < a 2 

A 



holds a.s. for every k > 1, and 72 = ^p, then it follows from (224) that for every p G [0, 1] 



P D V (P) \\ / ex p("^) +72exp(||) 



N 



Therefore, the maximal achievable rate that follows from this bound is given by 

RM) * m „ max l^fi-J "» Ha*) , } (226) 

P pe[o,i] 1 4<j2 I l + 72 » » 

where the double maximization is performed over the input distribution P and the parameter p G [0, 1]. The 
inner maximization in (226) can be expressed in closed form, leading to the following simplified expression: 




Ri(a 2 ) = max < 



7^(exp(^)- 



21 l+ 72 cxp(^iij2) 



(227) 



where 



^Hk)=pln(^)+(l-p)ln('[-^), Vp,g€(0,l) (228) 



denotes the Kullback-Leibler distance (a.k.a. divergence or relative entropy) between the two probability 
distributions (p, 1 — p) and (q, 1 — q). 

2) Second Bounding Technique Based on the combination of Theorem 17 and Eq. (224), we derive in the 
following a second achievable rate for random coding under ML decoding. Referring to the martingale- 
difference sequence {Z k , T k } k= i in Eqs. (220) and (225), one obtains from Eq. (224) that if for some even 
number m G N 

Z k <d, E[(Z k ) l \Tk-i]<m, WG{2,...,m} 
hold a.s. for some positive constant d > and a sequence {/U;}[!1 2 , and 

7 ,AW V/G{2 ,..., m}) 



then the average error probability satisfies, for every p G [0, 1], 

Pc£Cp(P)exp {_ Ar (^_ fl )}[ 1+ g^(ll)^ 7 ^ cxp(£ i ) , pd 



N 



This gives the following achievable rate, for an arbitrary even number m G N, 

A - {^P-'"( 1+ 1 ^ (£) ' + - M$ - 1 - &)) h 

where, similarly to (226), the double maximization in (229) is performed over the input distribution P and 
the parameter p G [0, 1]. 
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3) Achievable Rates for Random Coding: In the following, the achievable rates for random coding over various 
linear and non-linear channels (with and without memory) are exemplified. In order to assess the tightness of the 
bounds, we start with a simple example where the mutual information for the given input distribution is known, so 
that its gap can be estimated (since we use here the union bound, it would have been in place also to compare the 
achievable rate with the cutoff rate). 

1) Binary-Input AWGN Channel: Consider the case of a binary-input AWGN channel where 

Y k = U k + v k 

where [7j = ±A for some constant A > is a binary input, and Vi ~ Af(0,al) is an additive Gaussian 
noise with zero mean and variance a 2 . Since the codewords U = (U\, . . . ,Un) and U = (C/i, . . . , Un) are 
independent and their symbols are i.i.d., let 



P(U k = A) = P(U k = A) 



a, 



P(U k = -A) = P(U k = -A) = 1 



a 



for some a G [0, 1]. Since the channel is memoryless and the all the symbols are i.i.d. then one gets from 
(220) and (225) that 



Z k = E[||U - U||| | Jfc-i] - E[||U - U||l | F k ] 

'k-l N 

52(U j -U j ) 2 + '£ i E[(U j -U j ) 2 

j=l j=k 



k 
3=1 



N 



U ] ) 2 + ^[{Uj-Ujf 

j=k+l 



= E[(U k - U k ) 2 ] - (U k - U k ) 2 
= a(l - a){-2A) 2 + a(l - a)(2A) 2 - (U k - U k ) 2 
= 8a(l - a)A 2 - (U k - U k ) 2 . 
Hence, for every k, 

Z k < 8a(l - a)A 2 = d. 
Furthermore, for every k, I G N, due to the above properties 

E[(Z fc )'|.F fc _i] 
= E[W] 

= E[(8a(l-a)A 2 -(U k -U k ) 2 ) 1 

= [1 - 2a(l - a)] (8a(l - a)A 2 ) 1 + 2a(l - a) (8a(l - a)A 2 - 4A 2 ) 1 
and therefore, from (230) and (231), for every / G N 



(230) 



A W 



= [1 - 2q(1 - a)] 



1 + (-!)' 



1 - 2a(l 
2a(l - a) 



a 



z-i 



(231) 



(232) 



Let us now rely on the two achievable rates for random coding in Eqs. (227) and (229), and apply them to the 
binary-input AWGN channel. Due to the channel symmetry, the considered input distribution is symmetric 
(i.e., a = \ and P = (|, |)). In this case, we obtain from (230) and (232) that 



D y (P) = Var(C4) = A 2 , d = 2A 2 , 



1 + (-!)' 



, Vlel 



(233) 



Based on the first bounding technique that leads to the achievable rate in Eq. (227), since the first condition 
in this equation cannot hold for the set of parameters in (233) then the achievable rate in this equation is 
equal to 

A 2 / A 2 \ 

Rl {o 2 u ) = — -\n C o^{ — ) 
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in units of nats per channel use. Let SNR = 4£ designate the signal to noise ratio, then the first achievable 
rate gets the form 

R i x (SNR) = ^ - In cosh . (234) 

It is observed here that the optimal value of p in (227) is equal to 1 (i.e., p* = 1). 

Let us compare it in the following with the achievable rate that follows from (229). Let m G N be an even 
number. Since, from (233), 7/ = 1 for all even values of I G N and 7; = for all odd values of I G N, then 

1 , 7« - 7m ( pdV ( ( p d \ pd\ 



1=2 

r-1 



21+1 



pd \ ( ( pd\ pd 



Since the infinite sum (27+1)! ( ^"J * s rnonotonically increasing with m (where m is even and 

p G [0, 1]), then from (229), the best achievable rate within this form is obtained in the limit where m is even 
and m — > oo. In this asymptotic case one gets 



(a) , ^ 1 f pd\ 21+1 ( ( pd\ ^ pd 



where equality (a) follows from (235), equality (b) holds since sinh(x) = [fj+TJl ^ or x e ^> anc ^ 

equality (c) holds since sinh(x) + cosh(x) = exp(x). Therefore, the achievable rate in (229) gives (from 
(233), 0-2 = 7^2) 



R 2 (al) = max ( ^ - In cosh (^-)^ . 



Since the function f{x) = x — mcosh(x) for x G R is monotonic increasing (note that f'(x) = 1 — tanh(x) > 
0), then the optimal value of p G [0, 1] is equal to 1, and therefore the best achievable rate that follows from 
the second bounding technique in Eq. (229) is equal to 

A 2 , / A 2 



in units of nats per channel use, and it is obtained in the asymptotic case where we let the even number 
m tend to infinity. Finally, setting SNR = gives the achievable rate in (234), so the first and second 
achievable rates for the binary-input AWGN channel coincide, i.e., 

i?;(SNR) = i?' 2 (SNR) = ^ - In cosh (j^j • (237) 



Note that this common rate tends to zero as we let the signal to noise ratio tend to zero, and it tends to In 2 
nats per channel use (i.e., 1 bit per channel use) as we let the signal to noise ratio tend to infinity. 
In the considered setting of random coding, in order to exemplify the tightness of the achievable rate in 
(237), it is compared in the following with the symmetric i.i.d. mutual information of the binary-input AWGN 
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channel. The mutual information for this channel (in units of nats per channel use) is given by (see, e.g., [61, 
Example 4.38 on p. 194]) 



C(SNR) = In 2 + (2 SNR - 1) Q(VSNR) 
(-1) 



2 SNR 



7T 



exp 



(-¥) 



+ ]T | ,|, • exp(2i(i + 1) SNR) Q((l + 2i) VSNR) J (238) 

where the Q-function that appears in the infinite series on the right-hand side of (238) is the complementary 
Gaussian cumulative distribution function in (218). Furthermore, this infinite series has a fast convergence 
where the absolute value of its n-th remainder is bounded by the (n + l)-th term of the series, which scales 
like ^3 (due to a basic theorem on infinite series of the form J2 n &i(~^) n a n where {a n } is a positive and 
monotonically decreasing sequence; the theorem states that the n-th remainder of the series is upper bounded 
in absolute value by a n+ i). 

The comparison between the mutual information of the binary-input AWGN channel with a symmetric i.i.d. 
input distribution and the common achievable rate in (237) that follows from the martingale approach is 
shown in Figure 4. 




Fig. 4. A comparison between the symmetric i.i.d. mutual information of the binary-input AWGN channel (solid line) and the common 
achievable rate in (237) (dashed line) that follows from the martingale approach in this sub-section. 



From the discussion in this sub-section, the first and second bounding techniques in Section VI- J2 lead to 
the same achievable rate (see (237)) in the setup of random coding and ML decoding where we assume a 
symmetric input distribution (i.e., P(±A) = ^). But this is due to the fact that, from (233), the sequence 
{ji}i>2 is equal to zero for odd indices of I and it is equal to 1 for even values of I (see the derivation of 
(235) and (236)). Note, however, that the second bounding technique may provide tighter bounds than the 
first one (which follows from Bennett's inequality) due to the knowledge of {7;} for I > 2. This approach 
was exemplified in Table I in the context of the pairwise error probability (under ML decoding) for some 
binary-input discrete memoryless channels. 

2) Nonlinear Channels with Memory - Third-Order Volterra Channels: The channel model is first presented in 
the following (see Figure 5). We refer in the following to a discrete-time channel model of nonlinear Volterra 
channels where the input-output channel model is given by 



yi = [Du]i + Vi 



(239) 



68 



DRAFT. LAST UPDATED: OCTOBER 28, 2012 



TABLE III 

Kernels of the 3rd order Volterra system D\ with memory 2 



kernel 


MO) 


Mi) 


M2) 


Mo,o) 


MM) 


M0,i) 


value 


1.0 


0.5 


-0.8 


1.0 


-0.3 


0.6 



kernel 


Mo, o,o) 


Mi, 1,1) 


Mo, o,i) 


Mo, i,i) 


value 


1.0 


-0.5 


1.2 


0.8 



kernel 


M0, 1,2) 


value 


0.6 



where i is the time index. Volterra's operator D of order L and memory q is given by 

L q q 

[Du]i = h + ^ /2 ■ ■ ■ /2 ■, i j) u i~ii ■ --Ui-i r (240) 

j=l ii=0 ij=0 

and v is an additive Gaussian noise vector with i.i.d. components Vi ~ J\f(0, a 2 ). 

Gaussian noise 
v 



Volterra 
Operator D 



Fig. 5. The discrete-time Volterra non-linear channel model in Eqs. (239) and (240) where the channel input and output are {Ui} and 
{Yi}, respectively, and the additive noise samples {vi}, which are added to the distorted input, are i.i.d. with zero mean and variance al. 

Under the same setup of the previous subsection regarding the channel input characteristics, we consider 
next the transmission of information over the Volterra system D\ of order L = 3 and memory q = 2, 
whose kernels are depicted in Table III. Such system models are used in the base-band representation of 
nonlinear narrow-band communication channels. Due to complexity of the channel model, the calculation of 
the achievable rates provided earlier in this sub-section requires the numerical calculation of the parameters 
d and a 2 and thus of 72 for the martingale {Zi,Ti}^L . In order to achieve this goal, we have to calculate 
\Z{ — Zi-i\ and Var(Zj|.Fj_i) for all possible combinations of the input samples which contribute to the 
aforementioned expressions. Thus, the analytic calculation of d and 7; increases as the system's memory q 
increases. Numerical results are provided in Figure 6 for the case where a 2 = 1. The new achievable rates 
R^{D\, A, a 2 ) and R2{D\, A, a 2 ,), which depend on the channel input parameter A, are compared to the 
achievable rate provided in [81, Fig.2] and are shown to be larger than the latter. 
To conclude, improvements of the achievable rates in the low SNR regime are expected to be obtained via existing 

improvements to Bennett's inequality (see [26] and [27]), combined with a possible tightening of the union bound 

under ML decoding (see, e.g., [63]). This direction of research is studied in [69]. 

VII. Summary and Outlook 

This section provides a short summary of this work, followed by a discussion on some directions for further 
research. 



•0- 



A. Summary 

This chapter derives some classical concentration inequalities for discrete-parameter martingales with uniformly 
bounded jumps, and it considers some of their applications in information theory and related topics. The first 
part is focused on the derivation of these refined inequalities, followed by a discussion on their relations to some 
classical results in probability theory. Along this discussion, these inequalities are linked to the method of types, 
martingale central limit theorem, law of iterated logarithm, moderate deviations principle, and to some reported 
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Fig. 6. Comparison of the achievable rates in this sub-section Ri(D\, A, at) and R£' (Di , A, a%) (where m = 2) with the bound 
R p (Di, A, al) of [81, Fig. 2] for the nonlinear channel with kernels depicted in Table III and noise variance a% = \. Rates are expressed 
in nats per channel use. 

concentration inequalities from the literature. The second part of this work exemplifies these refined inequalities in 
the context of hypothesis testing and information theory, communication, and coding theory. The interconnections 
between the concentration inequalities that are analyzed in the first part of this work (including some geometric 
interpretation w.r.t. some of these inequalities) are studied, and the conclusions of this study serve for the discussion 
on information-theoretic aspects related to these concentration inequalities in the second part of this work. Rather 
than covering a large number of applications, we chose to exemplify the use of the concentration inequalities by 
considering several applications carefully, which also provide some insight on these concentration inequalities. 
Several more applications and information-theoretic aspects are outlined shortly in the next sub-section, as a 
continuation to this work. It is aimed to stimulate the use of martingale approach for establishing concentration in 
information and communication-theoretic aspects. 

B. Topics for Further Research 

We gather here what we consider to be the most interesting directions for future work as a follow-up to the 
discussion in this chapter. 

• Possible refinements of Theorem 3: The proof of the concentration inequality in Theorem 3 relies on Bennett's 
inequality (31). This inequality is applied to a martingale-difference sequence where it is assumed that the 
jumps of the martingale are uniformly upper bounded, and a global upper bound on their conditional variances 
is available (see (32)). As was noted in [10, p. 44] with respect to the derivation of Bennett's inequality: 
"The above analysis may be extended when more information about the distribution of the component random 
variables is available." Hence, in the context of the proof of Theorem 3, consider a martingale-difference 
sequence {£fc,^fc}fc=o where, e.g., is conditionally symmetrically distributed around zero given Tk-i (for 
k = l,...,n). This additional property enables to obtain a tightened version of Bennett's inequality, and 
accordingly to improve the exponent of the concentration inequality in Theorem 3 under such an assumption. 
This direction has been recently studied in [68], and it calls for suitable applications. 

• Channel polarization: Channel polarization was recently introduced by Arikan [5] to develop a channel coding 
scheme, called polar codes, that was demonstrated to be a capacity-achieving coding scheme for memoryless 
symmetric channels under sequential decoding, and with a feasible encoding and decoding complexity. The 
fundamental concept of channel polarization was introduced in [5, Theorem 1], and it was proved via the 
convergence theorem for martingales. This analysis was strengthened in [6] where the key to this analysis 
is in [6, Observation 1]; it stated that the random processes that keep track of the mutual information and 
Bhattacharyya parameter arising in the course of the channel polarization are, respectively, a martingale and 
a super-martingale. Since both random processes are bounded (so they fit the setting in Theorem 5), it is of 
interest to consider the applicability of concentration inequalities for refining the martingale-based analysis of 
channel polarization for finite block-lengths. A martingale approach to optimize the kernel of polar codes for 
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g-ary input channels (where q is a prime number) has been studied in [2] by maximizing the spread of the polar 
martingale for noise additive channels. It shows that over GF(g), for q > 2 that is prime, the martingale spread 
can be significantly increased as compared to the original kernel in [5], leading in some cases to remarkable 
improvements in the performance of polar codes even with small to moderate block lengths. The study in [2] 
stimulates further research of the issue of optimizing the polar kernels by following the martingale approach, 
and possibly some concentration inequalities introduced in our work. 

• Message-passing decoding for graph-based codes: The concentration inequalities which have been proved in 
the setting of iterative message-passing decoding so far rely on Azuma's inequality. They are rather loose, 
and much stronger concentration phenomena are observed in practice for moderate to large block lengths. 
Therefore, to date, these concentration inequalities serve mostly to justify theoretically the ensemble approach, 
but they are not tight bounds for finite block lengths. It is of interest to apply martingale-based concentration 
inequalities, which improve the exponent of Azuma's inequality, to obtain better concentration results. To this 
end, one needs to tackle the problem of evaluating (or efficiently bounding) the conditional variance of the 
related martingales. Some results on this direction are presented in [24] by refining the proper constants that 
follow from the Azuma-Hoeffding inequality for the studied applications. 

• Martingale-based Inequalities Related to Exponential Bounds on Error Probability with Feedback: As a follow- 
up to [55, Section 3.3] and [58, Theorem 11], an analysis that relies on the refined versions of Azuma's 
inequality in Section IV (with the standard adaptation of these inequalities to sub-martingales) has the potential 
to provide further results in this direction. 



Appendix A 
Proof of Lemma 3 

The first and third properties of ip m follow from the power series expansion of the exponential function where 

y l= m 1=0 y ' 

From its absolute convergence then lim y _»o <£m{y) = 1, and it follows from the above power series expansion that 
(p m is strictly monotonic increasing over the interval [0, oo). The fourth property of ip m holds since 

7Tl\ 

Vrniv) = ■ Rm-l(v) 

where R m -i is the remainder of the Taylor approximation of order m — 1 for the exponential function f(y) = e y . 
Hence, for every y < 0, 

<p m (v) = f {m) (0 = e? 

for some £ £ [y, 0], so < tp m {y) < 1. The second property of ip m follows by combining the third and fourth 
properties. 



Appendix B 
Proof of Corollary 4 

The proof of Corollary 4 is based on the specialization of Theorem 4 for m = 2. This gives that, for every 
a > 0, the following concentration inequality holds: 



F(\X n - X \ >na)<2{ inf e~ dx 1 + j(e x - 1 - x) 

x>0 



(241) 



where 7 = 72 according to the notation in (29). 

By differentiating the logarithm of the right-hand side of (241) w.r.t. x (where x > 0) and setting this derivative 
to zero, it follows that 

1 — 7X 1 — 5 



r y(e x — 1) 



(242) 
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Let us first consider the case where 5 = 1. In this case, this equation is satisfied either if x = ^ or in the limit 
where x — > oo. In the former case where x = ^, the resulting bound in (241) is equal to 



cxp 



-n 



(l-ln( 7 ( e *-l))) 



(243) 



In the latter case where x — > oo, the resulting bound in (241) when 5 = 1 is equal to 

lim e- nx (l + j(e x - 1 - x)) n 



x— >oo 

= lim ( e~ x + 7(1 - (1 + x)e 
= 7". 



Hence, since for 7 G (0, 1) 



m ( — ) = In^yc 

<I_ln( 7 (^_l)) 



7 

then the optimized value is x = ^, and the resulting bound in the case where (5 = 1 is equal to (243). 

Let us consider now the case where < 5 < 1 (the case where 5 = is trivial). In the following lemma, the 
existence and uniqueness of a solution of this equation is assured, and a closed-form expression for this solution 
is provided. 

Lemma 6: If 5 G (0, 1), then equation (242) has a unique solution, and it lies in (0, ^). This solution is given 
in (61). 

Proof: Consider equation (242), and note that the right-hand side of this equation is positive for S G (0, 1). 
The function 

on the left-hand side of (242) is negative for x < and x > ^. Since the function t is continuous on the interval 
(0, i] and 

t ( - ) = 0, lim t(x) = +00 
\7 / x^o+ 

then there is a solution x g(0, Moreover, the function t is monotonic decreasing in the interval (0, ^] (the 
numerator of t is monotonic decreasing and the denominator of t is monotonic increasing and both are positive in 
this interval). This implies the existence and uniqueness of the solution, which lies in the interval (0, ^). In the 
following, a closed-form expression of this solution is derived. Note that Eq. (242) can be expressed in the form 

" ' ' (244) 



e x - 1 

where 



a^I.ftAl* (245) 
7 



are both positive. The substitution u = a + b — x in (244) gives 

be a+b 



ue u 



whose solution is, by definition, given by u = Wq (be a+b ) where Wq denotes the principal branch of the multi- 
valued Lambert W function [17]. Since a, b > then be a+ > 0, so that the principal branch of W is the only one 
which is a real number. In the following, it will be confirmed that the selection of this branch also implies that 
x > as required. By the inverse transformation one gets 

x = a + b — u 

= a + b-W (be a+h ^j (246) 
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Hence, the selection of this branch for W indeed ensures that x is the positive solution we are looking for (since 
a, b > 0, then it readily follows from the definition of the Lambert W function that Wq (be a+b ) < a + b and it was 
earlier proved in this appendix that the positive solution x of (242) is unique). Finally, the substitution of (245) 
into (246) gives (61). This completes the proof of Lemma 6. ■ 
The bound in (241) is given by 



W>(\X n -X \ >an) < 2exp^-n 6x - ln(l + j(e x - 1 - x)) ^ 



(247) 



with the value of x in (61). 



Appendix C 
Proof of Proposition 1 

Lets compare the right-hand sides of (49) and (241) that refer to Corollaries 2 and 4, respectively. Proposition 1 
follows by showing that if 7 < \ 

1 + 7(exp(x) - 1 - x) < cosh(x), Vi> 0. (248) 

To this end, define 

f{x) = cosh(x) — 7(exp(x) — 1 — x) , Vx > 0. 

If 7 < \, then for every x > 

f'(x) = sinh(x) — 7(exp(x) — l) 

/l \ exp(-x) 
= \2 ~ V exp W + 7 2 

>(^-7)+7-^ = 

so, since / is monotonic increasing on [0, 00) and /(0) = 0, then f(x) > for every x > 0. This validates (248), 
and it therefore completes the proof of Proposition 1. 

Appendix D 
Proof of Proposition 2 

Lemma 7: For every 7, x > 

7 < 1 + j(e x -l-x). (249) 

1 + 7 

Proof: Let 7 be an arbitrary positive number, and define the function 

/» 4 7e * + ^ - [1 + 7 ( e * -l-x)], x>0. 
1 + 7 

Then, / 7 (0) = 0, and the first derivative is equal to 

From the convexity of the exponential function y(u) = e u , then for every x > 

^ye x + e^^ x ( 7 \ / 1 \ 

' 2/0*0 + t~ — 2/(-7^) 



1 + 7 \1 + 7/ \ 1 + 7 

>y (iT^- x + TT^- ( - 7x) 
= y(0) = 1 

so, it follows that f!y(x) < for every x > 0. Since / 7 (0) = and the first derivative is negative over (0, 00), then 
/ 7 (x) < for every x > 0. This completes the proof of inequality (249). ■ 
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This claim in Proposition 2 follows directly from Lemma 7, and the two inequalities in (33) and (59) with m = 2. 
In the case where m = 2, the right-hand side of (59) is equal to 

{\ + 1 (e td -\-td)) n . 

Note that (33) and (59) with m = 2 were used to derive, respectively, Theorem 3 and Corollary 4 (based on 
Chernoff 's bound). The conclusion follows by substituting x = td on the right-hand sides of (33) and (59) with 
m = 2 (so that x > since t > and d > 0, and (249) turns from an inequality if x > into an equality if x = 0). 

Appendix E 
Proof of Corollary 6 

A minimization of the logarithm of the exponential bound on the right-hand side of (63) gives the equation 

U (*-!)! +7m( } 

and after standard algebraic operations, it gives the equation 

+ £{Rr 1 -— (H]tt}-i = o. 

As we have seen in the proof of Corollary 4 (see Appendix B), the solution of this equation can be expressed in 
a closed-form for m = 2, but in general, a closed-form solution to this equation is not available. A sub-optimal 
value of x on the right-hand side of (53) is obtained by neglecting the sum that appears in the second line of this 
equation (the rationality for this approximation is that {7^} was observed to converge very fast, so it was verified 
numerically that 7; stays almost constant starting from a small value of I). Note that the operation of inf^o can 
be loosened by taking an arbitrary non-negative value of x; hence, in particular, x will be chosen in the following 
to satisfy the equation 



7 ^--lj(e*-l-*) + 5 .. 

By dividing both sides of the equation by 72, then it gives the equation a + b — cx = be x with a, b and c from (65). 
This equation can be written in the form 

a + b \ _ v b 
x 



Substituting u = — x gives the equation 
whose solution is given by 



u b 
ue = - ■ e 



a + b_ 



u = Wo | - • e 



c 



where Wo denotes the principal branch of the lambert W function [17]. The inverse transformation back to x gives 
that 

a + b (b a+j, 

x = Wo - ■ e . 

c \c 

This justifies the choice of x in (64), and it provides a loosening of either Theorem 4 or Corollary 5 by replacing 
the operation of the infimum over the non-negative values of x on the right-hand side of (53) with the value of x 
that is given in (64) and (65). For m = 2 where the sum on the left-hand side of (250) that was later neglected is 
anyway zero, this forms indeed the exact optimal value of x (so that it coincides with Eq. (61) in Corollary 4). 
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Appendix F 
Proof of Proposition 4 

Let {Xk,J 7 k}'kLo be a discrete -parameter martingale. We prove in the following that Theorems 3 and 4, and also 
Corollaries 3 and 4 imply (69). For the sake of brevity, we introduce in the following the analysis that is related 
to Theorem 3. The others are technical as well. 

Let {Xk,J 7 k}k=o b e a discrete -parameter martingale that satisfies the conditions in Theorem 3. From (28) 



F(\X n - X \ > aty/n) < 2exp (-nD^j 



6' + 7 



+ 7 



7 



1 + 7 



where from (29) 



5' 



a 

A_ \/n 



_6_ 



(251) 



(252) 



From the right-hand side of (251) 



D 



5' +7 



1 + 7 

7 



1 + 7 



1 + 7 
1 + 



) In (l + 



+ 



l\fn) 7 



1 - 



In 1 - 



(253) 



From the equality 



(l + u)ln(l + u) = u + ^2 



k=2 



k(k-l) 



-1< u < 1 



then it follows from (253) that for every n > K 



nD 



5' +7 



1 + 7 



1 + 7 



5 2 <5 3 (l- 7 ) 1 
2^ 67 2 



+ ... 



n 



Substituting this into the exponent on the right-hand side of (251) gives (69). 

Appendix G 

Analysis Related to the Moderate Deviations Principle in Section V-C 

It is demonstrated in the following that, in contrast to Azuma's inequality, both Theorems 3 and 4 provide upper 
bounds on 



J>|> 



an ' 



Va>0 



which coincide with the exact asymptotic limit in (82). It is proved under the further assumption that there exists 
some constant d > such that \X^\ < d a.s. for every fceN. Let us define the martingale sequence {<Sfc,.Ffc}JLo 
where 

k 

Sk — '^Xi, Tk = cr(Xi, . . . ,X k ) 

i=l 

for every k G {1, . . . , n} with Sq = and = {0, J 7 }. 

1) Analysis related to Azuma's inequality: The martingale sequence {Sk, Fk}k=o nas uniformly bounded jumps, 
where \Sk — <Sfc_i| = \Xk\ < d a.s. for every k € {1, . . . , n}. Hence it follows from Azuma's inequality that, for 
every a > 0, 

( a 2 n 2r '- 1 ^ 
P(|Sn| > an") < 2exp — 



2d 2 



and therefore 



lim n 1 - 2 " lnP(|S n | > an") < 



(254) 



This differs from the limit in (82) where a 2 is replaced by d 2 , so Azuma's inequality does not provide the asymptotic 
limit in (82) (unless a 2 = d 2 , i.e., \X k \ = d a.s. for every k). 
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2) Analysis related to Theorem 3: The analysis here is a slight modification of the analysis in Appendix F with 
the required adaptation of the calculations for rf € (5, 1). It follows from Theorem 3 that, for every a > 0, 



F(\S n \ > an 11 ) < 2exp {-nD 
where 7 is introduced in (29), and 5' in (252) is replaced with 



6' 



■7 



1 + 7 



1 + 7 



a 



(255) 



due to the definition of (5 in (29). Following the same analysis as in Appendix F, it follows that for every n G N 



P(|5 n | > an 7 *) < 2exp 



2 7 



1 + 



Q ( 1 ~7) . n -(l-r,) 



37d 



+ 



and therefore (since, from (29), ^- = fa ) 



lim n 1 " 2 ^ lnP(|5J > an^) < - At- 

rt-^oo v 7 2(7 



(256) 



Hence, this upper bound coincides with the exact asymptotic result in (82). It can be shown that the same conclusion 
also follows from Theorem 4. 



C(P e , P e >) > E L (Pe, Pe>) > min { ^- - - \ - \ . (257) 



Appendix H 
Proof of Proposition 5 

The proof of (144) is based on calculus, and it is similar to the proof of the limit in (143) that relates the 
divergence and Fisher information. For the proof of (146), note that 

*L 1 

"1,2 {2^ 6 7 2 (l + 7i 

The left-hand side of (257) holds since E L is a lower bound on the error exponent, and the exact value of this 
error exponent is the Chernoff information. The right-hand side of (257) follows from Lemma 4 (see (141)) and 
the definition of E L in (145). By definition 7$ = ^ and 5i = ^ where, based on (131), 

e 1 4 D(P e \\P e ,), e 2 = D(P' e \\P e ). (258) 
The term on the left-hand side of (257) therefore satisfies 

5? S? 



2 7i 6 7 2 (l + 7 
el eh 



2oj 6a 2 (a 2 + d 2 ) 



so it follows from (257) and the last inequality that 

C(P e , P e .) > E L (P 9l P e ,) > min (259) 

Based on the continuity assumption of the indexed family {Pe}eee> then it follows from (258) that 

lim£i = 0, ViG {1,2} 

and also, from (112) and (122) with Pi and P2 replaced by Pg and P' e respectively, then 

limdi = 0, ViG {1,2}. 



76 



DRAFT. LAST UPDATED: OCTOBER 28, 2012 



It therefore follows from (144) and (259) that 

J{9) E L (P e ,P e ,) . / ej \ 

> hm —7- — r „ > hm mm < — ^— — i — — — > . (260) 

8 ~ e>-+e (9-9') 2 ~ e>->e i=i,2{2a 2 (9 - 9') 2 j 

The idea is to show that the limit on the right-hand side of this inequality is ^p- (same as the left-hand side), and 
hence, the limit of the middle term is also ^p-. 



llTTI 

9>^e 2o\{9-9') 2 
(±) lim D(P e \\P ei f 



e>^e 2o\{9 - 6>) 2 

(b) J{9) ^ D(Pg\\Pg, 



4 e>^e a{ 

(c) J{0) D{P e \\P , 
lim 



4 e'^e 



Z x e*Po(x) (ln^L- D(P e \\Pe'))' 



(A) J{0) Hm D{P e \\P e ,) 



4 ^E^W(lnftg) 2 - ^IW 2 



(•) J(#) 2 (9 - 9') 2 



lim 

8 e>^e 



E^^W(ln^) - D(P e \\P e , 



g Jjef lim - 0') 2 



( g ) ■/(*) 



(261) 



where equality (a) follows from (258), equalities (b), (e) and (f) follow from (143), equality (c) follows from (113) 
with Pi = Pq and P2 = Pq>, equality (d) follows from the definition of the divergence, and equality (g) follows 
by calculus (the required limit is calculated by using L'Hopital's rule twice) and from the definition of Fisher 
information in (142). Similarly, also 

4 _ J(9) 
2a 2 {9-9') 2 8 

so 

e 2 \ J{9) 



lim min , 

0>->e »=i,2 \ 2a 2 {9-9') 2 

Hence, it follows from (260) that lim0/_^ Eh {e^e^Y " = ^1T~' ^ n ^ s com pl etes the proof of (146). 
We prove now equation (148). From (112), (122), (131) and (147) then 



with e\ and £2 in (258). Hence, 



E L (P e ,P e ,) e\ 
(9>-9) 2 ~ e™82d 2 {9> -9f 
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and from (261) and the last inequality, it follows that 

E L (P e ,P e/ ) 



< 



(a) 



lim — 7T 



m _ 

8 e'^Xe d\ 

J{9) Y^Pe{*){^%$-D{P \\P»)j 
lim 



8 0'->0 



In 



D(P e \\P e , 



(262) 



It is clear that the second term on the right-hand side of (262) is bounded between zero and one (if the limit 
exists). This limit can be made arbitrarily small, i.e., there exists an indexed family of probability mass functions 
{Pg}eee f° r which the second term on the right-hand side of (262) can be made arbitrarily close to zero. For a 
concrete example, let a £ (0, 1) be fixed, and 9 G M + be a parameter that defines the following indexed family of 
probability mass functions over the ternary alphabet X = {0, 1, 2}: 

„ . . 6(1 — a) „ .„ . „ 1 — a 

Pe(0)= \ ; Pe(l) = a, P e {2) 



l + i 



Then, it follows by calculus that for this indexed family 



lim 



max^gA" 



In 



Pe(x) 



D{P e \\P e ,) 



l + i 



(l-a)6 



so, for any 6 G R + , the above limit can be made arbitrarily close to zero by choosing a close enough to 1. This 
completes the proof of (148), and also the proof of Proposition 5. 

Appendix I 
Proof of Lemma 5 

In order to prove Lemma 5, one needs to show that if p'(l) < oo then 



lim X^ + l) 2 ^ 



i=i 



ho 



1 - Ci 







(263) 



which then yields from (183) that B — > oo in the limit where C — > 1. 

By the assumption in Lemma 5 where p'{\) < oo then Yli^Li iPi < °°> an ^ therefore it follows from the 
Cauchy-Schwarz inequality that 



> 



1 



i=l 



■ — v^oo 
1 Ei=l l Pi 



> 0. 



Hence, the average degree of the parity-check nodes is finite 

1 



EOO £i_ 
i=l i 



< OO. 



The infinite sum X)*=i(* + l) 2 ^ converges under the above assumption since 

oo 

E^+i) 2 ^ 

1=1 

oo oo 



1=1 



^zpi + 2 ] + 1 < 00. 
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where the last equality holds since 

r, = 



So P( x ) dx 



= <T g ( -) , Vi G N. 



The infinite series in (263) therefore uniformly converges for C G [0, 1], hence, the order of the limit and the infinite 
sum can be exchanged. Every term of the infinite series in (263) converges to zero in the limit where C — > 1, 
hence the limit in (263) is zero. This completes the proof of Lemma 5. 



Appendix J 

Proof of the properties in (205) for OFDM signals 

Consider an OFDM signal from Section VI-I. The sequence in (203) is a martingale due to basic properties of 
martingales. From (202), for every i € {0, . . . , n} 



Y = E 



max \s(t;X , . . . ,X n -i)\ 

0<t<T' 1 



Xq, . . . , 



The conditional expectation for the RV Y{-\ refers to the case where only Xq, . . . , Xj_2 are revealed. Let X' i _ 1 
and Xi-i be independent copies, which are also independent of Xo, . . . , Xj_2,Xj, . . . , X n _i. Then, for every 

1 < i < n, 



Yi-i = E 



E 



max 

0<t<T 



s(t;Xo, . . . ,X\_ x ,Xi, . . . ,X„_i)| 



™&yL T \s(t;X , . . . . . . ,X n _i)| 



Xq, • • • j Xi-2 
Xq, . . . , Xj_2,Xj_i 



Since \E(Z)\ < E(\Z\), then far i € {1, ... , n} 



\U-V\ 



Xq, . . . ,Xi-i 



where 



U = maxjs(t; X ,..., X^ u Xi, X n _i)| 
V = maxJs(i;X , . . . . . . ,X„_i)|. 



From (200) 



|C7- V\ < meuc\s(t;X , . . .,X i - 1 ,X i , . . . ,X n _i) - s(i;X , . . . . . . ,X„_i)| 



= max — = 

o<t<r y^n 



(X ( _ 1 -X;_ 1 )e X p(^) 



By assumption, = |_X"^_ 1 1 = 1, and therefore a.s. 

\Xi-i — < 2 = 



I < 



(264) 



(265) 



In the following, an upper bound on the conditional variance Var(yj | = E[(Yj — l^_i) 2 | is obtained. 

Since (E(Z)) 2 < E(Z 2 ) for a real-valued RV Z, then from (264) and (265) 

e[(>* - y,-i) 2 < ^ • e xu - xi,\ 2 1 j;] 
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where T\ is the cr-algebra that is generated by Xo, . . . , X^\. Due to symmetry of the PSK constellation, it follows 
that 



< -E x > ^X^-XlrflTi] 

= — — X'^] 2 | X , . . . , Xi-i] 

= — E\\Xi-i - X-_i| 2 | Xi-i] 
n 1 J 



i r, 

= -E 
n L 

iw — J 

= — y 



Xi-i-X^I 2 !^-! = e« 



Af-1 

e m — e m 



M-l 

V D1 , 

nM 0111 \MJ n 
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